EfficientZero
EfficientZero copied to clipboard
Training is really slow
First of all, congratulations on the great work!
I've been trying to train an agent to play breakout and the training is really slow. This is really confusing to me since, according to the paper, it should take 7 hours to do a full training of 100k steps. My experience has been different:
Running time
- 4k steps each 8 hours
Hardware:
- 4 GPU (QUADRORTX6000)
- 80 CPUs (4 GB ram per CPU)
Running command
python main.py --env atari
--case BreakoutNoFrameskipv4
--opr train
--amp_type torch_amp
--num_gpus 4
--num_cpus 80
--cpu_actor 5
--gpu_actor 13
--seed 2917
--force
--use_priority
--use_max_priority
--debug
--p_mcts_num 1
Do you have any idea or advice so that we can optimize the runtime?
@YeWR
It seems you could try more cpu and gpu actors, such as --cpu_actor 14 --gpu_actor 20
. Since you have 4 RTX6000 and each RTX6000 has more than 20GB of memory, I think the original bash file train.sh
is runnable on your machine.
Hope this can help you :)