lerobot Add distributed training with `accelerate`

What this does

Enable launching multi-GPU training with HuggingFace accelerate and updates evaluation to handle mixed precision with python -m accelerate.commands.launch

https://huggingface.co/docs/accelerate/index You can now launch scripts on multiple GPUs using Data Parallelism, which will allow each GPU to process a batch synchronously.

That means you can choose to either

use python lerobot/scripts/train.py without breaking backward compatibility
use python -m accelerate.commands.launch lerobot/scripts/train.py and pass in the desired setup configuration parameters, ie : --num_processes=2, mixed_precision=fp16

The number of processes n has to be taken into account when setting training.offline_steps, as this refers to the number of global steps. During one global step, n batches are processed in parallel and seen by the network, there are n forwards and one backward.

Accelerator class

How it was tested

python lerobot/scripts/train.py still works the same

Launched runs on Aloha transfer cube

2 GPUs

python -m accelerate.commands.launch --num_processes=2 lerobot/scripts/train.py \
 hydra.job.name=base_distributed_aloha_transfer_cube \
 hydra.run.dir=/fsx/marina_barannikov/outputs/distributed_training_base/aloha_transfer_cube \
 dataset_repo_id=lerobot/aloha_sim_transfer_cube_human \
 policy=act \
 env=aloha env.task=AlohaTransferCube-v0 \
 training.offline_steps=50000 \
 training.accelerate.enable=true training.accelerate.num_processes=2 \
 training.eval_freq=10000 eval.n_episodes=50 eval.use_async_envs=true eval.batch_size=50 \
 wandb.enable=true

see run : https://wandb.ai/marinabar/lerobot/runs/yjwia33c?nw=nwuserm1bn

74% success rate after 50K global steps, which is the baseline success rate at 100K steps
after 50K steps, 797K samples are seen compared to 398K samples seen in the case of training on only 1 GPU training took 120 minutes
FP16 Half precision

python -m accelerate.commands.launch --num_processes=1 --mixed_precision=fp16 lerobot/scripts/train.py \
 hydra.job.name=fp16_distributed_aloha_transfer_cube \
 hydra.run.dir=/fsx/marina_barannikov/outputs/distributed_training_base/fp16_aloha_transfer_cube \
 dataset_repo_id=lerobot/aloha_sim_transfer_cube_human \
 policy=act \
 env=aloha env.task=AlohaTransferCube-v0 \
 training.offline_steps=50000 \
 training.eval_freq=10000 eval.n_episodes=50 eval.use_async_envs=true eval.batch_size=50 \
 wandb.enable=true

https://wandb.ai/marinabar/lerobot/runs/seanp0c2?nw=nwuserm1bn

78% success rate after 70K steps,
It took 120 minutes to complete that, which is about twice faster compared to full precision

How to checkout & try? (for the reviewer)

python -m accelerate.commands.launch--num_processes=2 lerobot/scripts/train.py \
 training.offline_steps=5000

python -m accelerate.commands.launch --mixed-precision=fp16 lerobot/scripts/eval.py \
--out-dir outputs/accelerate_eval/fp16 -p lerobot/diffusion_pusht eval.n_episodes=10 eval.use_async_envs=false eval.batch_size=10

Jul 11 '24 15:07 marinabar

I tried eval'ing with AMP on your branch and don't get a speedup vs no AMP. But on main I do get a speedup.

Specifically:

On main this takes 1:04: python lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true

On main this takes 0:54: python lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true use_amp=True

On yours this takes 1:05: python lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true

On yours this takes 1:05: python -m accelerate.commands.launch --mixed_precision=fp16 lerobot/scripts/eval.py -p lerobot/diffusion_pusht policy.num_inference_steps=100 +policy.noise_scheduler_type=DDIM eval.use_async_envs=true

Jul 24 '24 17:07 alexander-soare

Any immediate plans to support multigpu training?

Apr 29 '25 05:04 youliangtan

Thank you so much for the PR, however closing this as we recently supported multi-gpu training with accelerate: https://github.com/huggingface/lerobot/pull/2154

Oct 17 '25 11:10 jadechoghari