Add distributed training with `accelerate`
What this does
Enable launching multi-GPU training with HuggingFace accelerate and updates evaluation to handle mixed precision with python -m accelerate.commands.launch
https://huggingface.co/docs/accelerate/index You can now launch scripts on multiple GPUs using Data Parallelism, which will allow each GPU to process a batch synchronously.
That means you can choose to either
- use
python lerobot/scripts/train.pywithout breaking backward compatibility - use
python -m accelerate.commands.launch lerobot/scripts/train.pyand pass in the desired setup configuration parameters, ie :--num_processes=2,mixed_precision=fp16
The number of processes n has to be taken into account when setting training.offline_steps, as this refers to the number of global steps. During one global step, n batches are processed in parallel and seen by the network, there are n forwards and one backward.
How it was tested
python lerobot/scripts/train.py still works the same
Launched runs on Aloha transfer cube
- 2 GPUs
python -m accelerate.commands.launch --num_processes=2 lerobot/scripts/train.py \
hydra.job.name=base_distributed_aloha_transfer_cube \
hydra.run.dir=/fsx/marina_barannikov/outputs/distributed_training_base/aloha_transfer_cube \
dataset_repo_id=lerobot/aloha_sim_transfer_cube_human \
policy=act \
env=aloha env.task=AlohaTransferCube-v0 \
training.offline_steps=50000 \
training.accelerate.enable=true training.accelerate.num_processes=2 \
training.eval_freq=10000 eval.n_episodes=50 eval.use_async_envs=true eval.batch_size=50 \
wandb.enable=true
see run : https://wandb.ai/marinabar/lerobot/runs/yjwia33c?nw=nwuserm1bn
-
74% success rate after 50K global steps, which is the baseline success rate at 100K steps
-
after 50K steps, 797K samples are seen compared to 398K samples seen in the case of training on only 1 GPU training took 120 minutes
-
FP16 Half precision
python -m accelerate.commands.launch --num_processes=1 --mixed_precision=fp16 lerobot/scripts/train.py \
hydra.job.name=fp16_distributed_aloha_transfer_cube \
hydra.run.dir=/fsx/marina_barannikov/outputs/distributed_training_base/fp16_aloha_transfer_cube \
dataset_repo_id=lerobot/aloha_sim_transfer_cube_human \
policy=act \
env=aloha env.task=AlohaTransferCube-v0 \
training.offline_steps=50000 \
training.eval_freq=10000 eval.n_episodes=50 eval.use_async_envs=true eval.batch_size=50 \
wandb.enable=true
https://wandb.ai/marinabar/lerobot/runs/seanp0c2?nw=nwuserm1bn
- 78% success rate after 70K steps,
- It took 120 minutes to complete that, which is about twice faster compared to full precision
How to checkout & try? (for the reviewer)
python -m accelerate.commands.launch--num_processes=2 lerobot/scripts/train.py \
training.offline_steps=5000
python -m accelerate.commands.launch --mixed-precision=fp16 lerobot/scripts/eval.py \
--out-dir outputs/accelerate_eval/fp16 -p lerobot/diffusion_pusht eval.n_episodes=10 eval.use_async_envs=false eval.batch_size=10
I tried eval'ing with AMP on your branch and don't get a speedup vs no AMP. But on main I do get a speedup.
Specifically:
On main this takes 1:04:
python lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true
On main this takes 0:54:
python lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true use_amp=True
On yours this takes 1:05:
python lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true
On yours this takes 1:05:
python -m accelerate.commands.launch --mixed_precision=fp16 lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true
Any immediate plans to support multigpu training?
Thank you so much for the PR, however closing this as we recently supported multi-gpu training with accelerate: https://github.com/huggingface/lerobot/pull/2154