Deep reinforcement learning for real time strategy games

Daniil Markelov
Aug 23, 2022
10 min read

Updated: Nov 29, 2022

Hi everyone. I’m CubeMD, and today I'd like to tell you about my work on the Sanctuary project. I will try to keep this post entertaining and keep myself from going too deep into details, for those of you who are here for the story and not the class on AI.

I would not lie if I said that this post is long overdue. I have been working with Enhearten Media for over 6 months now. I joined the team as a volunteer because I loved the idea behind the project and the RTS genre as a whole.

This is an outline of what I am going to cover:

Who I am and what I do
What deep learning and reinforcement learning are
End goals
First experiments with DOTS and ML-Agents
Current work with continuous time
Summary of next steps

about myself

I started learning about how video games work in high school. Back then, I used to host heavily modded servers for games like Team Fortress 2, Left 4 Dead 2, Killing Floor 2, and, of course, Minecraft. I got Supreme Commander way back in 2007, and even though my playstyle resembled a half-dead turtle, I played through all the campaigns several times. The game took a special place in my heart and I carried the warmest of memories until I rediscovered it more than 13 years later.

I started seriously learning game development in 2018. My first game engine was Unreal, but later on I transitioned to Unity. My specialty is the design of reinforcement learning environments with game engines, primarily Unity. I’ve been using the ML-Agents toolkit since it was in early beta and have thoroughly studied the complexities of communicating data between game engines and Python, training parallelization, and the training algorithms themselves.

I experimented with applications of reinforcement learning to various games, and I learned that the design of the way that the agent interacts with the world is a crucial but often overlooked part of reinforcement learning. I make custom sensors, action and reward systems, and lay out the decision-making process to make agents that behave in the desired ways.

What is it all about

Classical AI

ree — https://chris.farkouh.net/blog/emotional-state-machine

The straightforward approach to AI can be roughly summarized by these 3 steps:

Take a human who is decent at the task that you are trying to solve
Condense the knowledge and behavior (relative to the task) of this human into a rule-based system
Implement this rule-based system, using the language and tools of your choice

This approach has been extensively utilized throughout the history of video games, which is why it is called classical AI. My colleague Uveso is working on such a system, and already has it up and running. I suggest taking a look at his blog posts if you are interested in the details of this approach.

This is a method with very important benefits:

It works as you design it - rule-based systems are transparent and easy to debug
It starts working quickly - finishing whole logic takes time, but the rules can be added incrementally
It is well explored - while these systems are not easy to implement, you can rely on other people's experience

learning-based ai

In contrast, learning-based approach fails in all previously described metrics:

We don't know why it does what it does, or at least it is not obvious
It takes time to set it up and train. Before that, it is not functional
It is very new. Code and literature for production games is rare

So why bother and what is it after all?

Deep Learning

Or Neural Networks, the buzzword that has infiltrated every topic and discussion in the last eight years. I will try to explain it in just two paragraphs, so bear with me.

Below you can see a typical representation of a neural network that would estimate the price of an apartment based on a few parameters:

Something is passed in as input, something is expected at the output
Initially, the output produced for particular inputs is going to be completely wrong
This picture illustrates an architecture - the path that input data goes through to become the output

During training, we pass inputs into the network and see what output it produces. Then we figure out how wrong it is by comparing it with the actual answer. Using calculus and backpropagation, we intelligently update trainable parameters in a way that would decrease error the most. We repeat this several times for every known example.

Everything else is just details (important, but details). Each layer does some transformation, and through training we find a transformation that produces the output that we expect to see. I've learned this perspective from Alfredo Canziani. There is a great visualization of these transformations illustrated in the video below. (13:55 - 16:00)

The key to training these networks for any task is having as many examples as possible. You can think that a neural network learns to fill in the blanks or interpolate the answers. If the network has not seen a certain breed of dog during its training, you can only hope that it produces the right output. If you could show a very large network all the pictures on the Internet, it would have very few blind spots, just as a well educated human would.

A neural network essentially learns to approximate an answer to any question through examples. To have an AI that uses it to play a video game, you would need to have examples of what it should do in almost every possible situation and use the described algorithm to train it. This is where reinforcement learning comes in.

Reinforcement learning

We can use our knowledge about the dynamics of the environment to avoid explicitly telling the agent what to do. We can introduce the notion of reward and say that the agent's goal is to collect as much of it as possible before the episode ends (due to either a win or loss). We do it to produce a system that decides what to do based on the notion that there exists a "better" action in every state.

With this system, instead of relying on collecting an astronomical number of examples of good actions, the agent plays the game directly and uses differences in total received rewards to figure out what the better action is in every state.

To be precise, we can say that during training, the goal of the system is to use a reinforcement learning algorithm to find a policy that uses the current state of the agent to produce an action that results in the highest possible expected (average) future reward.

The simplest policy could just be a table that would store how much reward the agent would on average get if it took a certain action in a certain state and followed this policy thereafter. It is called a Q (quality) table, and to figure out the desired action, you just find an action that has the biggest number associated with it in its current state and you are done.

Reinforcement learning is a pretty old field, and you might see how it is related to utility AI, Pavlovian conditioning, and many other directions.

But what if we don't want to visit every state-action pair to figure out its value or have so many state-action pairs that we can not fit them in the entire universe, let alone my poor PC?

Deep reinforcement learning

Then we approximate the Q using neural networks! Instead of storing state-action pairs in fancy tables, we just train a neural network to approximate desired outputs given certain inputs. The field of deep reinforcement learning is just a combination of techniques that allow us to create agents that use neural networks to estimate values from the field of reinforcement learning.

Of course, it is much more nuanced than that. It introduces a lot of instabilities and mathematical disasters, but it allows for a system that does not rely on handcrafted behavior and directly looks for a solution to the task (if the reward aligns with it). It also allows for interpolation - meaning that we should behave similarly in states that look alike but are not exactly the same.

I would like to note that everything above is an oversimplification of my own understanding of these complex topics. The world's brightest minds have spent their lives mastering these topics, developing software that requires minimal knowledge to achieve maximum results. There are different flavors to everything that I described, and this actively developing field builds on decades of work with every new publication.

What does it all mean and what do we actually want?

RTS games are hard. Really hard. While it is certainly possible to hard-code concrete behavior for a classical AI, it becomes increasingly difficult to account for all possible situations when you have hundreds of unit types, reclaim, an economy to manage, thousands of units on the battlefield, and many more parameters that you need to consider.

On top of that, with classical AI, we assume that we are able to figure out the best course of action in advance. That's a bald claim considering that players spend thousands of hours mastering micro and macro management and still find ways to exploit each other's strategies. While it is certainly appealing to aim at super-human performance, AI has to be fun to play against. It also has to be different from run to run, focusing on different strategies from the very beginning of the game.

And we are not talking about some Civilization, where by the end of the game you spend the majority of your time waiting for an AI to make its move. We need all of this to work in real-time with a minimum performance budget, ideally asynchronously.

Interesting gameplay requires flexibility, and this makes it appealing to just teach an agent what is required to be done, instead of coming up with all the rules ourselves. To achieve that, the agent needs to play thousands of games and learn from scratch. There are techniques that allow us to reuse the data from human gameplay, but ultimately the agent learns from direct interaction with its environment. This is a place where classical AI can be very well used to serve as an opponent for the agent.

Using the same system of perception and action, it is possible to create agents to achieve different goals. Instead of only serving as an opponent, AI could be trained to take care of chores like reclaiming or managing the whole economy while the player is busy with the higher-level task of battling an opponent. Thankfully, Sanctuary's extensive support for modding would easily allow for that.

The goal is to create a system that enables efficient perception and interaction with the environment. After that, we can proceed to train an agent to serve as an opponent for the human player or incorporate it into the gameplay in different ways. The amount of compute power available on the average consumer PC nowadays is astonishing, and this power can be used to allow for previously unseen gameplay. Learning-based artificial intelligence can shine very brightly in a great, complex environment like Sanctuary.

What has been done already

Performance is among our top priorities, and to figure out if the ML-Agents would be able to support the scale of battles in Sanctuary, we first tested out the experimental DOTS implementation with a simple game.

To put it to its limit, every agent is individually controlled. The goal of the agents was to survive as long as possible and collect blue orbs while avoiding any projectiles. There are two teams, each using a separate policy.

Even though the goal of the test was just to see the performance, the agents were trained to resemble some sort of flocking behavior. They were trained for just a couple of hours on consumer-grade hardware.

The crude, out-of-the-box implementation was able to support a couple thousand agents while being bottlenecked by a prototype perception system.

What else has been done already

My most recent work has been focused on the way that agents are making decisions. Usually, the agent is asked to make a decision with a certain frequency. So let's say every 0.1 seconds the agent can choose its desired action and the environment is going to react accordingly.

This system fits the paradigm of continuous control of a physics body pretty well, but would waste precious compute power by being asked the same question over and over while the agent wants to wait for an appropriate moment to issue its next command. This would be especially noticeable in an RTS game, where oftentimes agents would need units to arrive at a destination or a building to be completed.

This is what I spent my summer on, and if you like this post, I encourage you to read about my findings in this article on Medium.

Long story short, I recreated the small Dino game in Unity and trained agents with different fixed decision frequencies to solve it. The catch is that one type of the agent was able to also output a delay before performing this action. It allowed for much lower decision frequencies and more stable training.

While this does not solve all of the issues in the way that the agent interacts with the environment, it allows for more freedom and performance, as we are making fewer decisions per second. This work required me to dig through a lot of papers on the topic, and I now have more aces up my sleeves to squeeze out even more performance.

WHAT IS NEXT

To train the agent, we need the environment to behave in a way that resembles the actual gameplay. As the Sanctuary continues its development, I am shifting my focus to experiments on the actual game. To train an agent, it needs to be able to:

Perceive the world. This includes all sorts of data about units, terrain, economy, etc.
Take action in the world. This means selecting units and giving out orders.
Receive rewards according to the current performance. This is not limited to winning and losing and should be used to shape the agent's behavior.
Be able to restart the game. During training, the agent lives through a ground hog day, endlessly exercising before facing the real world.

Of course, I am using the success stories of OpenAI Five in Dota 2 and Deepmind AlphaStar in Starcraft 2 as references. According to the researchers of both projects, the majority of their time was spent developing decision, perception, and action systems.

While the perception system needs to be complex to capture all the required information to make an educated decision, I believe that agents could utilize a much simpler controller than the ones described in the papers. The controller could resemble an imaginary mouse that the agent could learn to use to select units and give out orders in a similar way to how humans do it. This would also solve the problem of "unfairness" when AI is able to control all units at the same time, though I am less concerned about this issue at the moment. I believe that if an opponent is playing without obvious cheats like being able to see through the fog of war or having a resource multiplier, it is a worthy opponent.

Such a controller would also enable us to easily use replay data to "warm up" the agent by forcing it to imitate the real player's behavior and only then proceed to train in the actual environment. Researchers state that they were able to train competitive agents by only using replay data in both games in reasonable amount of time.

The amount of compute power used to train these models is insane, but I would like to remind you that these companies tried their best to achieve super-human performance. I believe that with pretraining and other tricks, we are going to be able to create human-like AI that would be fun to play with and hopefully be a welcome guest in public lobbies, causing havoc and destruction on the field of battle.

I am incredibly thankful to Enhearten Media for letting me work with their game and hope that I will be able to deliver on my promises.

Thank you for reading this article to the end, it certainly was a long one.