Support multi-gpus training with accelerate
What this does
This PR supports training on multiple gpus using the accelerate librarie
How it was tested
Launching training on aloha sim with multiple GPUs and obtaining similar scores.
Examples: This requires installing accelerate:
pip install accelerate
POLICY=act
ENV=aloha
TASK=AlohaTransferCube-v0
REPO_ID=lerobot/aloha_sim_transfer_cube_human
DATASET_NAME=aloha_sim_transfer_cube_human
TASK_NAME=lerobot_${DATASET_NAME}_${POLICY}_gpus${GPUS}
TRAIN_DIR=$WORK/logs/lerobot/$TASK_NAME
echo $TRAIN_DIR
PORT=29502
GPUS=2
OFFLINE_STEPS=100000
EVAL_FREQ=1000
BATCH_SIZE=8
EVAL_BATCH_SIZE=10
SAVE_FREQ=10000
export MUJOCO_GL=egl
python -m accelerate.commands.launch --num_processes=$GPUS --mixed_precision=fp16 --main_process_port=$PORT lerobot/scripts/train.py \
--policy.type=$POLICY \
--dataset.repo_id=$REPO_ID \
--env.type=$ENV \
--env.task=$TASK \
--output_dir=$TRAIN_DIR \
--batch_size=$BATCH_SIZE \
--steps=$OFFLINE_STEPS \
--eval_freq=$EVAL_FREQ --save_freq=$SAVE_FREQ --eval.batch_size=$EVAL_BATCH_SIZE --eval.n_episodes=$EVAL_BATCH_SIZE
@bot /style
Style fixes have been applied. View the workflow run here.
“[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 2 has a total capacity of 31.74 GiB of which 43.38 MiB is free. Process 3850605 has 31.69 GiB memory in use. Of the allocated memory 31.08 GiB is allocated by PyTorch, and 94.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)”,How to solve the problem of memory overflow?
Thank you so much for the PR, however closing this as we recently supported multi-gpu training with accelerate: https://github.com/huggingface/lerobot/pull/2154