lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

Support multi-gpus training with accelerate

Open mshukor opened this issue 10 months ago • 3 comments

What this does

This PR supports training on multiple gpus using the accelerate librarie

How it was tested

Launching training on aloha sim with multiple GPUs and obtaining similar scores.

Examples: This requires installing accelerate:

pip install accelerate

POLICY=act

ENV=aloha
TASK=AlohaTransferCube-v0
REPO_ID=lerobot/aloha_sim_transfer_cube_human
DATASET_NAME=aloha_sim_transfer_cube_human

TASK_NAME=lerobot_${DATASET_NAME}_${POLICY}_gpus${GPUS}
TRAIN_DIR=$WORK/logs/lerobot/$TASK_NAME
echo $TRAIN_DIR

PORT=29502

GPUS=2
OFFLINE_STEPS=100000
EVAL_FREQ=1000
BATCH_SIZE=8
EVAL_BATCH_SIZE=10
SAVE_FREQ=10000

export MUJOCO_GL=egl

python -m accelerate.commands.launch --num_processes=$GPUS --mixed_precision=fp16 --main_process_port=$PORT lerobot/scripts/train.py \
     --policy.type=$POLICY  \
     --dataset.repo_id=$REPO_ID \
     --env.type=$ENV \
     --env.task=$TASK \
     --output_dir=$TRAIN_DIR \
     --batch_size=$BATCH_SIZE \
     --steps=$OFFLINE_STEPS \
     --eval_freq=$EVAL_FREQ --save_freq=$SAVE_FREQ --eval.batch_size=$EVAL_BATCH_SIZE --eval.n_episodes=$EVAL_BATCH_SIZE  

mshukor avatar Feb 26 '25 16:02 mshukor

@bot /style

qgallouedec avatar Feb 27 '25 15:02 qgallouedec

Style fixes have been applied. View the workflow run here.

github-actions[bot] avatar Feb 27 '25 15:02 github-actions[bot]

“[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 2 has a total capacity of 31.74 GiB of which 43.38 MiB is free. Process 3850605 has 31.69 GiB memory in use. Of the allocated memory 31.08 GiB is allocated by PyTorch, and 94.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)”,How to solve the problem of memory overflow?

zzzmy-all avatar Jun 19 '25 07:06 zzzmy-all

Thank you so much for the PR, however closing this as we recently supported multi-gpu training with accelerate: https://github.com/huggingface/lerobot/pull/2154

jadechoghari avatar Oct 17 '25 11:10 jadechoghari