Beating Atari using Neuroevolution, TensorFlow and Ray’s RLlib

Daan Klijn
6 min readJan 11, 2021

--

In 2017 Uber AI Labs showed that Deep Neuroevolution is able to learn to play various Atari 2600 games and can even beat other state of the art methods while doing so. The form of Neuroevolution that Uber used consisted of a simple Genetic Algorithm that only used very basic mutation operators. As it truly amazes me that such a simple setup is able to compete on such a complex environment I wanted to reproduce it in the following blog post.

TensorFlow and RLlib

I chose to use Tensorflow as it still seems the most popular Deep Learning framework and because there is no reproduction of this paper using TensorFlow yet. RLlib is a popular Reinforcement Learning framework that is built on top of Ray and advertises itself as a “Scalable Reinforcement Learning Framework”. Out of the box RLlib offers many features that most RL experiments needs. It also includes a broad range of algorithms and environments. Training the PPO in the Cartpole environments can be done with only a single line of code and some imports.

from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
tune.run(PPOTrainer, config={"env": "CartPole-v0"})

Genetic Algorithms in RLlib

Unfortunately at this point in time RLlib does not have Genetic Algorithms as one of their included algorithms. Therefore we are required to write our own implementation of it. But hey, that’s more fun for us right?

The model

We start by implementing the Neural Network that was used by Uber. The architecture of this network was taken from the classic DQN-Atari paper that was introduced by DeepMind in 2015. This network consists of 32, 64, and 64 channels followed by a hidden layer of 512 nodes. Furthermore each of these layers uses a ReLU activation function. Below we define a class for this model that also contains some logic to mutate the weights, by adding some Gaussian noise to all of them.

The main reason why Genetic Algorithms perform so well on the task at hand is because they can easily scale across multiple workers, while causing almost no extra overhead. To set up this scalable approach in RLlib, we need to implement two classes: A Trainer class and a Worker class. The Trainer class can be seen as the one that orchestrates all the (distributed) workers of our algorithm.

The Worker class

Below we start with the implementation of the Worker class. The class is quite straightforward, it only contains the __init__ and the evaluate method. The __init__ method creates the initial model and sets up the Atari environment using the inputted env_creator function. The evaluate method is used to evaluate a set of weights for a single episode of the game. First it sets the weights of the model and then it will mutate them. Afterwards it will play a game until the game finishes or the maximum number of time steps is reached and finally it will return the results of the episode to the Trainer.

The Trainer class

Next we will create the Trainer class that orchestrates all the workers. Again this class is quite simple and contains only the __init__ and the step method. During initialization the __init__ method is called and it will create the workers and set some initial properties like the number of episodes. The step method represents every step in our training process, for Neuroevolution each step is actually a new generation. In the step we do the following things:

  1. Divide all the weights of the elites over the workers and ask them to mutate and evaluate the weights. If there are no elites yet, None is passed to the worker and the worker will generate the initial weights.
  2. Retrieve the results from the workers using ray.get .
  3. Determine which weights resulted in the highest score and use those to set the current elite_weights .
  4. Compute some statistics and return them so that Ray can log them.

Running the experiment

Finally we define our main.py that we will call to run our Genetic Algorithm. In this file we first create a function that we use to set up the environment, in this case ‘Frostbite’. We use some of gym’s Atari wrappers to scale and stack the frames. And we also add a random amount of NOOP (not doing anything) actions at the start of the game to create some randomness, which will make the learned agent more robust. Furthermore we define our config and call tune.run to run our experiment. We also use the WandbLogger to track our experiment. This logger logs everything to Weights and Biases, which is a great experiment tracking framework that is free to use for personal projects.

The results

Uber AI let their Genetic Algorithm play for 1 Billion game frames using a distributed cluster with 720 cores. This took about one hour to complete. Uber also implemented some logic that allowed for efficiently communicating the weights between the nodes, which made it possible to easily distribute the algorithm. For simplicity’s sake I did not implement that and therefore distributing the algorithm will be harder. As I also don’t mind waiting a bit I just used a single e2-highcpu-32 on Google Cloud Platform to run the experiment. This instance has 32 cores and 32 GB of RAM available and was able to process 500 million frames and 120 generations in about 12 hours. After this time the Genetic Algorithm was able to reach a high score of 3,290, which comes quite close to the 4,536 that was seen by Uber after 1 billion of frames. Below the minimum, median and maximum performance can be found for all of the generations.

Below two recordings of the best agent playing its worst and best game out of 32 games. Note that there is quite a big difference between its minimum and maximum performance, which is probably due to the lack of robustness towards the added NOOP randomness. We also see that it makes some dumb mistakes like jumping into the water while standing next to a completed igloo, which could have taken him to the next level. Clearly, there is still a lot of room for improvements.

Best policy achieving its maximum score.
Best policy achieving its minimum score.

Another reproduction of the Deep Neuroevolution paper showed that Deep Neuroevolution does not perform well on other games, like Breakout. The author stated that this might be due to the low number of pixel that are changing between frames in Breakout. As these changes are small, mutations might not be able to pick this up. To test this hypothesis I shortly ran the same experiment for Pong, which also only has a few pixels changing between each frame.

Best agent playing Pong

We can see that we are able evolve a decent Pong playing agent, but that it seems to get stuck in some local optimum. After 80M frames it evolves a very defensive policy that first loses 1 points and then is able to trick the game into some kind of loop where it will keep playing the same moves until the games finishes. To get out of this this local optimum one could try one of the following things:

  1. Penalize repeating a sequence of moves.
  2. Add randomness to the action selection or add some noise to the inputs.
  3. Increase the mutation power.
  4. Implement some exploration strategy.

Although the training got stuck, this does show that it’s possible to train Deep Neuroevolution on a game that has a minimum number of pixel changes between frames.

Conclusion

We implemented Uber’s Deep Neuroevolution Genetic Algorithm and have seen that is able to play Frostbite and Pong. As this is only a very simple Genetic Algorithm, I believe that there is much more possible with this form of Deep Neuroevolution and I hope I can dive into that in a next blog post.

Thanks for reading!
~Daan

All code can be found here.

--

--