ga-openai-gym
ga-openai-gym copied to clipboard
Usage of genetic algorithms to train a neural network in multiple OpenAI gym environments.
Usage of genetic algorithms to train a neural network in the multiple OpenAI gym environments.
Random Carracing | GA Carracing |
---|---|
![]() |
![]() |
Random BipedalWalker-v2 | GA BipedalWalker-v2 | Small BipedalWalker-v2* |
---|---|---|
![]() |
![]() |
![]() |
* -- see more here
Random Cartpole-v0 | GA Cartpole-v0 |
---|---|
![]() |
![]() |
GA SlimeVolley-v0 |
---|
![]() |
The Slime Volleyball environment turned out to be quite hard for proposed method. The interesting fact to observe is that trained agent (on the right) tries to imitate movement of its opponent. |
Table of contents
- Overview
- Usage
- Details
-
Extra experiments
- Model compression and knowledge distillation
- Multitask learning
- Environments
Overview
The project aims to train neural networks using genetic algorithms. Instead of minimalizing the cost function using common optimizers such as: SGD or Adam the simple GA was used. The algorithm was trying to alter the weights and biases of the neural network to achieve the best score of the fitness function. The profound description of genetic algorithms as well as the used environments are shown below.
Usage
Preliminary
Create virtual environment (python3 -m venv name-of-venv && source name-of-venv/bin/activate
) and install all necessary packages (pip install -r requirements.txt
).
Test pretrained model
The pretrained models are located in models/{name_of_environment}
directory. For example to check how the model performs in BipedalWalker environment, specify name of the
pretrained model in tests/bipedalwalker/testing_model_bipedalwalker.py
script (also one have to adjust the model architecture) and run that script.
Train neural models
Due to my limited computing resources, in training neural models I have used Spell platform (I really recommend it for smaller project)
spell run "python scripts/spell/bipedalwalker_mlp_spell.py" --pip-req requirements.txt
For Carracing environment (need to open virtual display server)
spell run --apt python-dev --apt cmake --apt zlib1g-dev --apt libjpeg-dev \
--apt xvfb --apt ffmpeg --apt xorg-dev --apt python-opengl --apt libboost-all-dev \
--apt libsdl2-dev --apt swig \
"Xvfb :1 -screen 0 1024x768x24 -ac +extension GLX +render -noreset &> xvfb.log ; export DISPLAY=:1 ; python scripts/spell/carracing_conv_spell.py" \
--pip-req requirements.txt
Details
The question arises, since the main objective of neural network is to minimalize the cost function, why can't we use the genetic algorithms instead of SGD or Adam optimizer? The approach, used in this project, assumes application of simple GA in order to train the neural networks in the multiple OpenAI Gym environments.
From neural network to genotype
![]() |
![]() |
![]() |
---|
The figure on the left hand-side depicts the steps in order to turn the weights and biases of the neural network into the genotype. Firstly, the matrices are flattened and the biases vectors are concatenated to them. After one episode of the genetic algorithm, the genotype (vector) is converted into the input neural network architecture (the figure on the right hand-side).
Genetic Algorithm
The vast number of genetic algorithms are constructed using 3 major operations: selection, crossover and mutation. In those experiments I checked many different types of the mentioned algorithms.
- Setup -- As in many various generic algorithms at the beginning the population of individuals (neural networks) are created. The weights and biases of them are initialized randomly.
- Selection -- For each individual the fitness function are calculated. I tested both the ranking and the roulette wheel selection and the former method worked significantly better. As the result, the two individuals are selected with the highest fitness score.
- Crossover -- The 'parents' (two neural networks) are decomposed into the flat vectors and then the simple or the BLX-alpha crossover is performed.
- Mutation -- Each made child has a chance to mutate i.e., alter the weights or biases. I found out that bigger mutation (uniform distribution with large range) is vital to achieve greater results.
- Repeat -- If the sum of fitness score of the created children is greater than theirs parents, the children go to the next generation. The process starts from the beginning (step 2.) until the number of the generations reached the limit or the fitness score was satisfied.
Fitness function
The calculation of the fitness function varies, it depends on the environment. However, the basic example to compute is as follows:
env = gym.make("your environment")
def compute_fitness_function(env, model, n_episodes: int):
obs = env.reset()
fitness = 0
for n in n_episodes:
action = model(obs)
obs, reward, done, info = env.step(action)
fitness += reward
if done:
break
return fitness
![]() |
---|
Mean value of the fitness function for BipedalWalker-v2 problem |
The fitness function in this example is non monotonic and the variance for the generation above 1000 is sizeable. These situations were common during training.
Extra experiments
Model compression and knowledge distillation
Pruning trained neural network
Can we reduce the sizes of the neural network while keeping good performance in the particular environment? The first step was to gather the training data for the smaller (student) model i.e. I took the best model (from BipedalWalker environment) and ran it multiple times whilst saving both the features (observations from environment) and the labels (models' actions). The initial neural model architecture was shrinkage significantly, from 24-20-12-12-4
to 4-4-4
, interestingly, preserving the ability to complete the task.
Multitask learning
![]() |
![]() |
---|---|
Example scheme for training the agent in two environments | Example model architecture for multitask learning |
I used Hard parameters sharing approach to train both the CartPole and BipedalWalker agents. The figures above depict how the neural model architecture looks like.
I also checked how the trained model will behave, if I fed it with different input values, that is:
- Both agents receive correct observations from environments
- CartPole agent receives random noise from uniform distribution, while BipedalWalker agent receives correct observations
- CartPole agent receives correct observations, while BipedalWalker agent receives random noise from uniform distribution
- Both agents receive random noise from uniform distribution.
The results are shown below. To conclude, the both model inputs are important, but not equally. It is clearly visible that for 2. option, despite noised observations in CartPole, the BipedalWalker agent performs reasonable well.
![]() |
![]() |
---|---|
1. BipedalWalker | 1. CartPole |
![]() |
![]() |
2. BipedalWalker | 2. CartPole |
![]() |
![]() |
3. BipedalWalker | 3.CartPole |
![]() |
![]() |
4. BipedalWalker | 4. CartPole |
Environments
CarRacing-v0 |
Architecture |
---|---|
Observation space: - RGB 96x96 pixels frame (with all the control bar on the bottom of the screen) |
|
Actions(continues): - sterring angle (s = [-1, 1]) - throttle (t = [0, 1]) - brake (b = [0, 1]) |
|
Reward - (-0.1) every frame and +1000/N for every track tile visited, where N is the total number of tiles in track |
|
Genetic algorithm parameters - population size: 50 - generation: 2000 - mutation rate: 0.4 - crossover rate: 0.8 |
BipedalWalker-v2 |
Architecture of neural network |
---|---|
Environment (continuous) - 24 observations (hull_angle, vel_x, vel_y and many more) |
|
Actions (continues): - Hip_1 and Hip_2 (Torque / Velocity) - Knee_1 and Knee_2 (Torque / Velocity) |
|
A reward is given for moving forward, total of 300+ points up to the far end. If the robot falls, it gets -100 | |
Episode Termination The episode ends when the robot body touches ground or the robot reaches the far right side of the environment |
|
Genetic algorithm parameters - population size: 50 - generation: 2000 - mutation rate: 0.3 - crossover rate: 0.9 |
Cartpole-v0 |
Architecture of the NN |
---|---|
Environment (continuous) - 4 observations (Cart Position and Velocity, Pole Angle and Velocity at Tip) |
|
Actions (discrete): - Push a cart to the left or right |
|
Reward - 1 for every step taken, including the termination step |
|
Episode Termination - Pole Angle is more than ±12° - Cart Position is more than ±2.4 (center of the cart reaches the edge of the display) - Episode length is greater than 200 |
|
Genetic algorithm parameters - population size: 100 - generation: 20 - mutation rate: 0.4 - crossover rate: 0.9 |
SlimeVolley-v0 |
Architecture of the neural network |
---|---|
Environment (continuous) - 12-dim vector with position and velocity of each agents and the ball |
![]() |
Actions (discrete): - 3-dim vector (go left, right or jump) |
|
Reward For that task I designed custom fitness function: |
|
Episode Termination The episode ends when either the agent losses all five lives, or after 3000 timesteps have passed |
|
Genetic algorithm parameters - Population: 30 - Number of generations: 2000 (the best model at iteration 1824) - Mutation rate: 0.1 - Crossover rate: 0.8 |