ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: bug in training rm with ddp strategy with single machine multi-GPUs!

Open xHansonx opened this issue 2 years ago • 3 comments

🐛 Describe the bug

Code:

torchrun --standalone --nproc_per_node=1 train_reward_model.py --dataset Dahoas/rm-static --subset ../../../datasets/Dahoas_rm-static --max_len 512 --model gpt2 --pretrain ../../../gpt2/gpt2-small --lora_rank 0 --max_epochs 1 --batch_size 1 --loss_fn log_sig --test True --need_optim_ckpt True --strategy ddp --save_path rm_ckpt.pt

Error:

image image

Environment

No response

xHansonx avatar Apr 04 '23 02:04 xHansonx

Can I know your environment settings such as your machine type as well as torch, Python versions?

JThh avatar Apr 05 '23 19:04 JThh

Can I know your environment settings such as your machine type as well as torch, Python versions?

------------ Environment ------------ Colossal-AI version: 0.2.8 PyTorch version: 1.12.1 System CUDA version: 11.3 CUDA version required by PyTorch: 11.3

xHansonx avatar Apr 06 '23 01:04 xHansonx

Sorry for getting to your questions late. May I know why you are setting nproc_per_node=1 when you have multiple nodes on the machine and set the strategy to be ddp?

JThh avatar Apr 24 '23 15:04 JThh

Thanks for reporting. #4023 Contains this now.

cwher avatar Jun 29 '23 07:06 cwher