HighwayEnv icon indicating copy to clipboard operation
HighwayEnv copied to clipboard

List of Trained Models

Open hildebrandt-carl opened this issue 2 years ago • 12 comments

Hi, thank you for this amazing work.

I was wondering if there is a list of pre-trained models that we could use? I am mostly working on highway-v0 but any models would be appreciated.

Kind regards, Carl

hildebrandt-carl avatar Jul 14 '21 20:07 hildebrandt-carl

Hi! That is a great idea, It requires a bit of work in running a well-chosen RL library with a stable version, so that results can be reproduced and the models can be loaded again. Stable-baselines is probably the best choice.

I do not really have the time to run these training experiments right now, but I'll think about it when I do.

In the meantime, you are welcome to open a pull requests if you do have some trained models that you are willing to share :)

eleurent avatar Jul 23 '21 10:07 eleurent

Thanks for the feedback. I am glad you like the idea and more than happy to run the training experiments.

I, however, am having problems training the models. I wrote some code to train a DQN on the highway-v0 environment using the stable_baselines package. I then did a hyperparameter search over some of the different hyperparameters. Specifically, I looked at varying:

  • gamma
  • learning_rate
  • buffer_size
  • learning_starts
  • exploration_fraction
  • batch_size

However, I had no luck finding a model that is even remotely as successful as the example in the readme.md. I am a little lost on where I am going wrong. I have attached the code I am using to do the training. Any suggestions will be very much appreciated (and implemented and tested promptly :P ).

import gym
import highway_env
import numpy as np
from stable_baselines import DQN

# Create the environment
env = gym.make("highway-v0")
obs = env.reset()

# Create the model
model = DQN('MlpPolicy', env,
            gamma=0.8,
            learning_rate=5e-4,
            buffer_size=40*1000,
            learning_starts=200,
            exploration_fraction=0.6,
            batch_size=128,
            verbose=1,
            tensorboard_log="logs/")

# Train the model
model.learn(total_timesteps=int(1e5))
model.save("dqn_model")
model.load("dqn_model")

# Run the algorithm
done = False
obs = env.reset()
while not done:
    # Predict
    action, _states = model.predict(obs)
    # Get reward
    obs, reward, done, info = env.step(action)
    # Render
    env.render()

env.close()

hildebrandt-carl avatar Jul 23 '21 11:07 hildebrandt-carl

Hi @hildebrandt-carl Please not that the gif in the readme is not representative of all episodes obtained with the final policy, some of them still result in collisions (this can be improved by the use of more suitable observation and/or models). However, you should still be able to obtain reasonable behaviors like the one showcased in the gif (slowing down before a car, attempting to change lane and overtake, etc) with this standard configuration. Your hyperparams look alright to me, I'll try to run your script and investigate if anything goes wrong.

eleurent avatar Jul 26 '21 12:07 eleurent

Hi @eleurent,

I trained 10 models using the code shown above. The only major difference was that I only trained them for half the number of time-steps (5e4). However looking at the tensorboard logs this seemed reasonable as the reward had started to level out (the rewards are viewed with smoothing set to 0.9).

rewards

I used the following code to run the trained models:

import gym
import argparse
import highway_env

import numpy as np
from tqdm import tqdm
from stable_baselines import DQN

parser = argparse.ArgumentParser()
parser.add_argument('--model_name', type=str, default="output", help="The save name of the run")
parser.add_argument('--episodes', type=int, default=1, help="The number of episodes you want to run")
parser.add_argument('--render', action='store_true')
args = parser.parse_args()

# Create the environment
env = gym.make("highway-v0")
obs = env.reset()

# Create the model
model = DQN('MlpPolicy', env,
            gamma=0.8,
            learning_rate=5e-4,
            buffer_size=40*1000,
            learning_starts=200,
            exploration_fraction=0.6,
            batch_size=128,
            verbose=1,
            tensorboard_log="output/dqn_models/logs/" + str(args.model_name))

# Train the model
model.load('output/dqn_models/models/' + str(args.model_name))

# Init a crash counter
crash_counter = []
total_runs = args.episodes

for e_count in tqdm(range(total_runs), desc="Episode"): 

    # Run the algorithm
    done = False
    obs = env.reset()
    while not done:
        # Predict
        action, _states = model.predict(obs)
        # Get reward
        obs, reward, done, info = env.step(action)
        if args.render:
            # Render
            env.render()

    if info["crashed"] == True:
        crash_counter.append(1)
    else:
        crash_counter.append(0)

env.close()
print("The results:")
print("Crash array: {}".format(crash_counter))
print("Crash percentage: {}".format(np.sum(crash_counter) / len(crash_counter)))

Using this code I ran two experiments:

Experiment 1) Question: Using visual inspection what behaviors do the models display. Command: python3 run_dqn.py --render --episodes 1 --model_name model0 Result: I noticed that each model fit into 1 of 4 categories [Note: Each gif is shown at 2x playback speed]:

  1. It accelerated and never changed lanes, eventually crashing. straight
  2. It switched lanes until it was in on the outer edge of the road, and then accelerated. change_lane
  3. It just idled and never changed lanes. idle
  4. It changed lanes randomly all the time. (Did not get a recording of it :( )

Experiment 2) Question: How often does the model crash before ending the episode over 100 runs. Command: python3 run_dqn.py --episodes 100 --model_name model0 Result: The results over all 10 models were: Minimum crash rate: 71% Maximum crash rate: 100% Mean crash rate: 91% Median crash rate: 94%

I have attached the model files as a zip if anyone wants to replicate these results: models.zip

I am going to repeat this with more time-steps during the training, just to confirm that its not a matter of training longer. Other than that I am pretty stumped. I am more than happy to explore more directions in my own time, just hoping for any advice or directions you might recommend exploring.

Carl

hildebrandt-carl avatar Aug 01 '21 16:08 hildebrandt-carl

I had a similar behaviour, I believe going idle is some sort of local maxima.

My understanding for why it happened was as fallows:

Gamma us lower so far rewards are discounted.

If we look into the rewards, there is a marginal difference between rewards gathered by other actions.

I believe u can change the rewards and see.

PhanindraParashar avatar Aug 09 '21 12:08 PhanindraParashar

Hi @hildebrandt-carl, sorry for the delay

I have run some experiments and I do reproduce some of your results (left lane + idle behaviour until crash), so there may be a regression somewhere.

For comparison, here is what I had back in May 2021 (sorry for the 1 fps video). https://user-images.githubusercontent.com/1706935/130582672-d5752fd3-a181-48a2-abe2-401c5a91cd76.mp4

I'll try to investigate this matter in the coming days.

eleurent avatar Aug 24 '21 08:08 eleurent

And just to answer some of your comments:

@hildebrandt-carl : The only major difference was that I only trained them for half the number of time-steps (5e4). However looking at the tensorboard logs this seemed reasonable as the reward had started to level out

That may be misleading, since the exploration schedule is chosen from the total number of timesteps, so this leveling also can be due to a decrease in exploration rather than reaching a local optimum.

@PhanindraParashar Gamma us lower so far rewards are discounted. If we look into the rewards, there is a marginal difference between rewards gathered by other actions.

With gamma=0.8, we optimise about 10 actions, which representst 10s of driving and is enough to plan an overtake / slow down maneuver to avoid collisions. The rewards difference mostly come from the crashed (r=0) vs non-crashed status (r >= 0.5), which is enough to learn collision avoidance. These values of rewards and discount have worked properly in the past, so I don't think this is the proper cause.

eleurent avatar Aug 24 '21 08:08 eleurent

EDIT: for good measure, I tried to train a DQN model using my own rl-agents repository rather than stablebaselines. Here is the policy that I get after 1500 episodes (of max duration 20 steps, i.e. max 3e4 timesteps in total, 26 minutes of training):

evaluate configs/HighwayEnv/env.json configs/HighwayEnv/agents/DQNAgent/dueling_ddqn.json --train --episodes=1500 --verbose --no-display

https://user-images.githubusercontent.com/1706935/130597405-bd8e15ad-b25f-4cbf-a907-63a82682d10c.mp4

https://user-images.githubusercontent.com/1706935/130597447-3c792ddf-e9b4-4448-a0e2-1170e37584fc.mp4

https://user-images.githubusercontent.com/1706935/130597495-73222fa1-64d3-4eb1-b593-0084f456a9e6.mp4

So I think that the issue is related to stable-baselines, rather than highway-env. And I think there was a major release of stable-baselines3 recently, so something may have changed in the codebase, like the default value of some hyperparameters? I'll keep investigating in that direction.

eleurent avatar Aug 24 '21 10:08 eleurent

Hi @eleurent,

Thanks for the detailed response! Sorry I was busy with a paper deadline. The results you get with your own rl-agents repository looks much better! I will try a few experiments over the weekend on my side and get back to you, with more findings. I am hoping to wrap this up with a few trained models and a write up or example on how to use it for anyone else running into this issue.

hildebrandt-carl avatar Sep 02 '21 17:09 hildebrandt-carl

Hi @hildebrandt-carl , no worries! This morning I started investigating the differences of hyperparameters between SB3 and my own rl-agents, and after fixing a few of them I get much better results: see below (after just 2e4 steps, training takes about 20mn)

image

So I suggest you upgrade highway-env to its latest version, and then use the scripts in https://github.com/eleurent/highway-env/tree/master/scripts as a starting point.

eleurent avatar Sep 03 '21 11:09 eleurent

I was loking at some hirercial methods like option critic, PPOC and others. What I think is that if we divide the task into several hirercies we could fasten the learning process.

1st level learns to pick a lane 2nd level learns to pick a waypoint to reach 3rd level (controller) would learn to execute

I am still working on that, @eleurent whats your opinion on this.

PhanindraParashar avatar Sep 03 '21 12:09 PhanindraParashar

@PhanindraParashar that looks like great idea, but I think you should open a separate discussion for this :)

eleurent avatar Sep 03 '21 12:09 eleurent

I had this problem once, and I think the more important aspect of this problem, besides the influence of hyperparameters, is the choice of state space. The default value Kinematics may not be optimal, and I was able to get better results for most hyperparameters after changing it to occupancy grid. @hildebrandt-carl

SEUCGX avatar Jan 15 '23 02:01 SEUCGX