EfficientZero Training is really slow

Training is really slow

Open SergioArnaud opened this issue 3 years ago • 1 comments

First of all, congratulations on the great work!

I've been trying to train an agent to play breakout and the training is really slow. This is really confusing to me since, according to the paper, it should take 7 hours to do a full training of 100k steps. My experience has been different:

Running time

4k steps each 8 hours

Hardware:

4 GPU (QUADRORTX6000)
80 CPUs (4 GB ram per CPU)

Running command

python main.py --env atari 
                           --case BreakoutNoFrameskipv4 
                           --opr train 
                           --amp_type torch_amp 
                           --num_gpus 4 
                           --num_cpus 80 
                           --cpu_actor 5 
                           --gpu_actor 13 
                           --seed 2917 
                           --force 
                           --use_priority 
                           --use_max_priority 
                           --debug 
                           --p_mcts_num 1

Do you have any idea or advice so that we can optimize the runtime?

@YeWR

Jan 29 '22 00:01 SergioArnaud

It seems you could try more cpu and gpu actors, such as --cpu_actor 14 --gpu_actor 20. Since you have 4 RTX6000 and each RTX6000 has more than 20GB of memory, I think the original bash file train.sh is runnable on your machine.

Hope this can help you :)

Feb 21 '22 03:02 YeWR

EfficientZero EfficientZero copied to clipboard

Training is really slow

EfficientZero
EfficientZero copied to clipboard