SPINLOCKS AND FLAME GRAPHS

Timevir
Aug 14, 2021
5 min read

Updated: Nov 29, 2022

If you haven’t already read last week’s blog on targeting, you can read it here: https://www.sanctuary-rts.com/post/how-to-choose-a-target-from-10-000-options Warning: Highly technical writing for programmers. Don't worry if you don't understand it all.

One of the major issues with the targeting algorithm being multithreaded was creating a concurrent reusable list in Unity’s Burst. As I mentioned last week, Unity’s nativeList cannot be resized while utilized in parallel. Also the list cannot be used in threads because it is passed by value struct, so each independent thread has its own copy. Operations such as adding to it or resizing only changes the thread’s copy, but not the copies in other threads. This is a problem since we want these to be synchronized. So I had to create a version of this for myself; a list collection that can be copied, and that can be resized across threads. I didn’t want to reallocate the memory every time it is moved, as that is computationally expensive, so I tried making something akin to a segmented list (meaning an array of pointers), but I quickly realized this was not going to work, because it wouldn't be very performant on smaller lists (sizes below 1000), where copying should be fast and the overhead is generally low. So I just decided to make a thread-safe wrapper for the existing UnsafeList collection. Except Burst does not support lock statements usually used in concurrent programs, so now what? Here cometh spinlocks to thy rescue.

For those who don’t know what a spinlock is, I will describe it with an allegorical example. Imagine a one-person bathroom is currently occupied, and you are sitting outside it with somebody else, checking the lock to see if it is free. Unlike a real bathroom, this one has many different doors, and you don’t see the person getting out. The most important part for both of you is being careful to not enter the bathroom at once, just because a door said it was free.

For this we can use the atomic compare exchange instruction. Atomicity means the instruction cannot be executed at once by different threads and return an incorrect result, hence you can use it to lock the bathroom entirely, then use it. Afterwards you need to explicitly unlock it to leave. So I implemented that, and it almost worked, but there were still some race conditions left. But how? Well what I could have done to solve my issues is to always check the lock whenever doing anything adding something to the list, but that would be bad for performance as I am only forced to lock when expanding the list’s capacity, not when adding data. So the basic code looked something like:


nextId = interlockedIncrement(idCounter);
while(lock) wait;
if(nextId == capacity)
{
  Lock();
  Expand();
  Unlock();
}
Add(data);

However, if somehow two threads entered at the exact same time, they would both pass the lock check; the first one would acquire it and work correctly, but the second one could try adding to the list that is being expanded and throw an error. The next attempt would naturally be to check if the capacity is enough, and spin while it is not.


nextId = interlockedIncrement(idCounter);
while(lock) wait;
if(nextId == capacity)
{
  Lock();
  Expand();
  Unlock();
}
while(capacity < nextId) wait;
Add(data);

But if you’re paying close attention, now I effectively just have 2 locks? So instead, I decided that my spinlock would not be the one implemented by compareExchange but one implemented using interlockedIncrement. It is an emergent property.


nextId = interlockedIncrement(idCounter);
while(capacity < nextId) wait;
if(nextId == capacity)
{
  Expand();
}
Add(data);

And everyone lived happily ever after. I mean it worked!

Let’s revisit the FinishAddingTargetsAndDistributeThem function from my previous developer log. With the solutions so far, the function contributes a *large* part of the performance cost. Like 90%.

This is a flame graph, and simply is a way we can look at how our code is executing on the CPU.

Looking at the function “void Unity.Collections.NativeSortExtension.Sort” (on the tallest part of the blob), that is the part where targets are sorted, but at the moment a large part of its stack is checking whether a unit is actually supposed to be targeted.

For example the target units might be air and the source unit can only shoot land units, or there might be units with higher priority, like for example shooting at aircraft, before shooting other units(if you are a flak gunship). Currently the unit categories are represented as a list, so to check if it contains x, you need to iterate through the entire list, which is really slow.

So we can optimize that by one hot encoding it instead. This is a type of hash table, but hash tables are slow compared to raw array access, so we can see it sped up the results on the graph, but not by a lot. Currently most of the time is taken by looping through target categories and seeing if they are in the array, so we need to optimize that.

Another optimization that was missing was the fact that units are supposed to find a target and then stop looping, but right now the issue is that I never accounted for unit hp, so all units were looping through everything trying to find a unit with living hp, meaning it was not working well when we have units being in range of everything (the worst case scenario). Currently we are doing 94973713/(1000*1000) = 94m comparisons. After adding actual hp to units we get: 94m -> 36m comparisons Reducing unit range from 100, to 15 we get: 15m -> 1m comparisons. So the code runs 100x faster if we account for hitpoints and reduce ranges to the amount that’ll be realistic for spammed units. One major flaw to fix still is that the list of targets does not remember where the previous source unit left off, adding additional repeated loops though more units in order to find a target with non 0 hp. Also, when there are few groups containing lots of units, we can reach a situation where groups are extremely uneven. One group might contain 5000 units, then we have 5 more with 500, so the issue is that we may have group level parallelism, but not inside the group itself, which makes for some very uneven jobs and therefore non-ideal distribution.

Recall however that this is the absolute worst case scenario, and this is when all units need to find a target in the same frame, which almost never occurs in real gameplay. Also the group building function is currently single threaded but takes 1.82ms, so in total the performance could still be so much better, but it is good enough for the time being. In my next log, I will focus on how we add features such as lua support. See you soon! BADUMP

SPINLOCKS AND FLAME GRAPHS

Recent Posts

Comments