AutonomousDrivingCookbook DistributedRL training - Loss value is so high and not coming down

DistributedRL training - Loss value is so high and not coming down

Open kalum84 opened this issue 6 years ago • 1 comments

Problem description

The loss values are so high and not coming down over time.

Problem details

We are trying to create a racing environment and use reinforcement learning to train a model to do racing. So we started from this example. We wanted to test how much time it needs to train a model and how fat it can reach. I used the same parameters in the example. Except following one

   max_epoch_runtime_sec = 30

Also didn't change the code. I attached the output file from one agent. Please help me to troubleshoot what the issue is.

Experiment/Environment details

Used existing weights to start with. Started training on Azure with 6 NV6 machines. 5 agents and the trainer. While running the job I restarted the agents after some time. (After 12h) Then run the training for another 20h agent1.txt

Jan 23 '19 14:01 kalum84

We discussed a bit offline, but this paper might be of interest to you.

The algorithm as written does not infinitely scale. Try 3 or 4 machines.

Also, the model will overfit - there is not concept of early stopping. Try checking back on it after an hour or an hour and a half.

Jan 30 '19 07:01 mitchellspryn

AutonomousDrivingCookbook AutonomousDrivingCookbook copied to clipboard

DistributedRL training - Loss value is so high and not coming down

Problem description

Problem details

Experiment/Environment details

AutonomousDrivingCookbook
AutonomousDrivingCookbook copied to clipboard