lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

Add distributed training with `accelerate`

Open marinabar opened this issue 1 year ago • 2 comments

What this does

Enable launching multi-GPU training with HuggingFace accelerate and updates evaluation to handle mixed precision with python -m accelerate.commands.launch

https://huggingface.co/docs/accelerate/index You can now launch scripts on multiple GPUs using Data Parallelism, which will allow each GPU to process a batch synchronously.

That means you can choose to either

  • use python lerobot/scripts/train.py without breaking backward compatibility
  • use python -m accelerate.commands.launch lerobot/scripts/train.py and pass in the desired setup configuration parameters, ie : --num_processes=2, mixed_precision=fp16

The number of processes n has to be taken into account when setting training.offline_steps, as this refers to the number of global steps. During one global step, n batches are processed in parallel and seen by the network, there are n forwards and one backward.

Accelerator class

How it was tested

python lerobot/scripts/train.py still works the same

Launched runs on Aloha transfer cube

  • 2 GPUs
python -m accelerate.commands.launch --num_processes=2 lerobot/scripts/train.py \
 hydra.job.name=base_distributed_aloha_transfer_cube \
 hydra.run.dir=/fsx/marina_barannikov/outputs/distributed_training_base/aloha_transfer_cube \
 dataset_repo_id=lerobot/aloha_sim_transfer_cube_human \
 policy=act \
 env=aloha env.task=AlohaTransferCube-v0 \
 training.offline_steps=50000 \
 training.accelerate.enable=true training.accelerate.num_processes=2 \
 training.eval_freq=10000 eval.n_episodes=50 eval.use_async_envs=true eval.batch_size=50 \
 wandb.enable=true

see run : https://wandb.ai/marinabar/lerobot/runs/yjwia33c?nw=nwuserm1bn

  • 74% success rate after 50K global steps, which is the baseline success rate at 100K steps

  • after 50K steps, 797K samples are seen compared to 398K samples seen in the case of training on only 1 GPU training took 120 minutes

  • FP16 Half precision

python -m accelerate.commands.launch --num_processes=1 --mixed_precision=fp16 lerobot/scripts/train.py \
 hydra.job.name=fp16_distributed_aloha_transfer_cube \
 hydra.run.dir=/fsx/marina_barannikov/outputs/distributed_training_base/fp16_aloha_transfer_cube \
 dataset_repo_id=lerobot/aloha_sim_transfer_cube_human \
 policy=act \
 env=aloha env.task=AlohaTransferCube-v0 \
 training.offline_steps=50000 \
 training.eval_freq=10000 eval.n_episodes=50 eval.use_async_envs=true eval.batch_size=50 \
 wandb.enable=true

https://wandb.ai/marinabar/lerobot/runs/seanp0c2?nw=nwuserm1bn

  • 78% success rate after 70K steps,
  • It took 120 minutes to complete that, which is about twice faster compared to full precision

How to checkout & try? (for the reviewer)

python -m accelerate.commands.launch--num_processes=2 lerobot/scripts/train.py \
 training.offline_steps=5000
python -m accelerate.commands.launch --mixed-precision=fp16 lerobot/scripts/eval.py \
--out-dir outputs/accelerate_eval/fp16 -p lerobot/diffusion_pusht eval.n_episodes=10 eval.use_async_envs=false eval.batch_size=10

marinabar avatar Jul 11 '24 15:07 marinabar

I tried eval'ing with AMP on your branch and don't get a speedup vs no AMP. But on main I do get a speedup.

Specifically:

On main this takes 1:04: python lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true

On main this takes 0:54: python lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true use_amp=True

On yours this takes 1:05: python lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true

On yours this takes 1:05: python -m accelerate.commands.launch --mixed_precision=fp16 lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true

alexander-soare avatar Jul 24 '24 17:07 alexander-soare

Any immediate plans to support multigpu training?

youliangtan avatar Apr 29 '25 05:04 youliangtan

Thank you so much for the PR, however closing this as we recently supported multi-gpu training with accelerate: https://github.com/huggingface/lerobot/pull/2154

jadechoghari avatar Oct 17 '25 11:10 jadechoghari