:bomb: Bomberman deep reinforcement learning challenge in PyTorch

Pommerman :bomb:

PyTorch based, reinforcement learning solution for the Pommerman competitions done as an exam project in course 02456 - Deep learning at DTU - Technical University of Denmark.

Our agent

Our agent always starts in the left upper corner.

Playing against 3 random agents


Playing against 1 simple and 2 random agents

1simple1simple2random not_perfect

Playing against 3 simple agents


:wrench: Requirements

In addition to pytorch (https://pytorch.org) and the usual data science packages (numpy, matplotlib) this project depends on the Pommerman playground (https://github.com/MultiAgentLearning/playground) to be installed in your Python environment. Additionally, A2C script depends on the colorama package which helps with the rendering of the game in terminal (perfect for running on remote servers).

:exclamation: DISCLAIMER

As this project uses the torch.multiprocessing package which is not compatible with jupyter notebook, the files to reproduce our results are suplied as individual python files. The guide for running these files is written below.

:chart_with_upwards_trend: Imitation learning

To start the imitation learnining, first place the log_simpleAgents_sequence_observe.py file in the path playground\pommerman\cli. Hereafter, place the file AA_RUN_LOG_SCRIPT.py in the playground folder and run it.

Observations from 10.000 games will now be collected and logged to three files in the pommerman folder. Once logging is complete, run the train_rnn_cnn.py to generate the trained imitation model.

Once the actor has been trained, the critic must also observe some games in order to learn to reward correctly before being allowed to affect the model. We do this by placing A3C_v10_cnn_lstm_train-critic.py and sharedAdam.py in the playground folder and running the the A3C_v10_cnn_lstm_train-critic.py file.

:chart_with_upwards_trend: A3C Model

To train the A3C model, place the A3C_v10_cnn_lstm.py and sharedAdam files in the pommerman folder and run the file A3C_v10_cnn_lstm.py. Inside the file you can specify a filename which will be used to save the checkpoint once the model has trained. This will also be used to load the checkpoint again if you wish to train further. The parameter MAX_EP specifies how many episodes to run before saving the checkpoint and terminating.

:chart_with_downwards_trend: A2C Model

To generate the convrnn-s.weights weights file (refreshed every 300 episodes):

python A2C/main.py train

To see how your agent plays (loads the convrnn-s.weights weights file and can be used while the training is running):

python A2C/main.py eval

During the training current gamma, running reward, action statistics and loss are printed after each episode. It takes around 48 hours to fully train this model (40000 episodes) on a modern 10 core CPU with a single 1080TI GPU. Additionally, a training.txt file is generated with the main statistics for each trained episode.

:hammer: Model

The full model that is used for this project can be seen in the below image


:bar_chart: Main results

From the following figure, we see that with 40.000 episodes that A2C performas better than A3C


Finally we have the reward for our architecture shown below


:page_facing_up: Paper

See our paper for detailed information about the project.

:bust_in_silhouette: Credits

  • Special thanks to @dimatter for the provided computational ressources :heart: