Austin Salgat's Journal: Parallelizing the game

From the beginning, the development of the game has had concurrency in mind, with the intent of not only taking advantage of more cores but also increasing the responsiveness of the game. The following is an overview of the architecture of the game and how it takes advantage of task parallelism to fully utilize multi-core processors.

The producer handles game logic and the state of the game, while the consumer displays the information for the player.

The game is divided into two parts: the game state, which handles both the game's current state and logic, and the client, which produces the graphics and submits input to the game state. This design pattern is called "Producer-Consumer" and poses a few issues with regard to thread-safety.

The producer, the game state, runs stand-alone and for the most part is unaware of the graphical client. The game state's job is to handle all the logic and maintain the current state of what is happening. It runs the AI, physics, health, damage, etc in the game. The client on the other hand, depends wholly on the game state in order to display all the information about the game. Now, the easiest way for the client to get data from the game state is to simply read all the public variables exposed by the game state and use that information. The issue comes when the game state starts updating variables, breaking valid data structures temporarily and becoming thread-unsafe. To avoid this issue, the game state does two things.

First, the game state, after updating, gathers all information that may be useful to the graphical client, and creates a data structure containing this information. This information is wrapped in a mutex while being updated so that it is always thread-safe to read from. The graphical client each frame update sees if the mutex is free, and if so, copies over the shared pointers to the data structures that the game state has provided, and then releases the mutex again. The game state creates new objects under different shared pointers every update, so that the graphical client never has to worry about thread-safety with the objects it has references to. Additionally, once the client is done with the objects, they will automatically destruct when the client reads in the new shared pointers from the game state.

The second method for safely reading data from the game state is the use of atomic types for most of the basic information held by the game state. Every now and then the client needs to read information about the game state that isn't normally available, so it atomically reads whatever variables it needs to from the game state. Only one assumption is made about the game state: the game state could be changing that variable at any time. This means that if something like position is being changed, you may end up with the x coordinate being from the last update and the y coordinate being from the current updated cycle. Thankfully these cases are rare and are handled by the client just fine; especially since the next read will quickly correct the data. It should be noted that atomic types for most of the information is not necessary if the proper precautions are made. For this reason, using atomic types are much safer at the cost of a small performance hit.

So far, two threads have been covered (the producer, being the game state, and the consumer, being the graphical client). While this does allow for up to 2 cores to be utilized by the game, the scalability of the parallelism is still non-existent. This is where the second means to add parallelism exists.

The game state loosely utilizes something called the "Actor Model", which drastically increases the parallelism of the game. In the game state, almost everything is represented by a single class, the "Unit" class. This Unit class represents the player's character, the monsters, the portals, even the items you can pick up on the ground. Each Unit has variables that define its behavior, including Unit type (if it's an item on the ground or a unit that walks around), AI type (if it is a melee creature, or just an item that has no actions), etc. This Unit class represents the Actors in the game.

While having Actors is an important part of the Actor Model, it is useless without messages to send to the Actors to tell them what to do. A basic class called a "Request" class exists that controls the actions of all Actors. If you want to tell your character to run to a location, the graphical client will send a Request to the Unit to walk somewhere. If a monster's AI wants it to attack you, it will send an Attack Request. Since all these requests (the messages to the Units) are only processed once every cycle, all requests are added to a concurrent queue held by the game state to await processing the next cycle.

The Game State utilizes many actors that operate in parallel for each step in the Game State's update.

Here's where task parallelism kicks in and the program scales to as many cores as available. Every cycle, the game state steps through different stages to update the game. First, it determines the player's position and creates a list of every unit within a certain distance of the player (this is done for efficiency, since we don't need to update units in other maps or far away). Next, the first task parallelism kicks in and every Unit is updated concurrently. This update is fairly basic, and updates basic variables like determining if a Unit is dead, calculates its stats for the current cycle based on items equipped and auras, how much light it emits, energy regeneration, and so on. Since this update depends only on the Unit's own variables and no other, every Unit can be processed at the same time in a thread-safe manner.

Once all Unit's stats are updated, we move to the next parallelized task: the Unit's AI. As of now, every Unit's stats (including health and other stats like energy and speed) are already updated and available to be read. Each Unit's AI is called concurrently and every AI freely reads the game state to make a decision. Since no Unit is being updated or changed, all data being read is thread-safe. Once an AI is finished processing and has made a decision, it creates a Request (the message in the Actor Model), and submits that request to the game state for future processing.

At this point, every Unit's stats have been updated (including health and other stats), and every Unit's artificial intelligence has made its decision for this cycle. The final step is to carry out the actions of every Unit's AI. The game state filters all Requests in a very simple manner to make sure they are all valid (including limiting each Unit to only one Request per cycle), and then processes all the Requests concurrently. Each Unit has functions that handle every available Request type, and these are called along with the Request object to read all specific data (such as which Unit to attack or the direction to cast a fireball).

This is the part of the game state cycle that is thread-unsafe if not handled properly. For example, one Request may modify one variable that another Request is reading or also writing. There is also no guarantee on the order in which the Requests are processed. If for example, one Request is for a spell that halves the life of a Unit, while another spell does constant damage, the order in which each spell occurs changes the end result.

To resolve the issue with two Requests modifying and reading the same variable simultaneously, atomic types are used for any variable that may be modified by a Request being handled. Request handling limits all modifications to only modify atomic variables so that thread safety is maintained. For the second issue concerning the seemingly random order of Request handling, thread-safety is maintained regardless, and at a frequency of 20Hz (20 updates per second), issues with predictability caused by random request order are minimal. When networking is implemented, this will be something that will cause de-synchronization and will have to be accounted for when the clients regularly re-synchronize.

It should be noted that when dealing with multiple cores, cache coherency plays a very large role in the performance of each thread. The overhead of a single mutex can be well over 100 cpu cycles, which makes them expensive operations to perform. Additionally, false sharing, where two cores are actively writing and reading to the same cache line (even if it's two separate variables), can delay each thread sharing cache lines by over 100 cpu cycles each time false sharing occurs, due to needing to constantly update their caches.

The good news is that for the models used by the game, these issues are mostly avoided. For the Producer-Consumer model, a temporary "buffer" of data is created (as noted previously) that is read-only and only used by the consumer thread. This buffer is only protected by a mutex once every 50mS which is a negligible overhead and since no data is written to it once it is created, false sharing will not occur. Also, for the consumer (the client), only a try_lock() is used once every frame (and only if a new buffer is available), so it will never face blocking due to the mutex being locked by the game state. As far as the game state blocking, the client will only lock the mutex for a very short time (long enough to copy a vector of ~100 shared pointers), which combined with the fact that the mutex is only acquired by the game state every 50mS results in both a very low chance of blocking and even then a block only consists of a few hundred cycles every 50mS which is negligible. While you'd normally want to avoid any blocking between a producer and the consumer (using something like a queue or double buffer), the negligible amount of blocking that does occur is more than sufficient with just using a mutex.

For the second means of parallelism, the "Actor Model", two of the three stages used have minimal cache coherency performance issues. The first stage, updating each Unit's stats, limits each thread to only modifying its own Unit. No data is shared between two threads, so no false sharing occurs. Additionally, this allows each Unit to modify data without needing mutexes. For the second stage, the AI, the game state is in a read only mode, which again produces no issues of cache coherency or false sharing. The final stage, each Unit executing its action, is the only case where performance can be an issue. Although no mutexes are used, atomic operations are used which can create similar performance issues (although they are typically faster). The good news is that due to the nature of the requests being handled, most of time units are only handling their own data (such as walking, which only affects a Unit's own variables), which helps to minimize the chance of two threads sharing the same cache line.

Intel's parallelism library "Threading Building Blocks" provides both concurrent containers and parallel loops.

The three different parallelized tasks, updating the stats, making decisions with AI, and executing the decisions, all utilize the Threading Building Blocks library by Intel. Certain concurrent data structures (queues, hash maps, and vectors) from the TBB library are used along with parallel_for_each, which is what concurrently executes the tasks in an efficiently managed manner. Explicit std::thread's are used for the game state and the graphical client. The beauty of parallel_for_each is that it automatically manages a thread pool and manages how the tasks are spread across the thread pool to maximize performance while accounting for overhead involved with doing this. In an ideal environment with zero thread overhead, the game state could utilize as many cores as available on a computer, future proofing the game and enhancing performance as core count increases.

Austin Salgat's Journal

Friday, February 6, 2015

Parallelizing the game

No comments: