Open-Assistant
Open-Assistant copied to clipboard
RuntimeError: Timed out initializing process group in store based barrier on rank 2
I am trying to run pretrain of LLaMA 30b. And here is my running cmd:
deepspeed trainer_sft.py --configs defaults llama-30b-pretrain pretrain --cache_dir $DATA_PATH --output_dir $MODEL_PATH/llama-30b-pre --deepspeed
And after the model was loaded, it stucked for a long time(I think it was 30 mins for the default timeout of pytorch is 30mins). And this error is raised:
RuntimeError: Timed out initializing process group in store based barrier on rank 2 # for all rank
Any solutions?
We have not seen this error during our training runs. Could you try smaller/different models first? Are you using the latest version of deepspeed? Which GPU and cuda version are you using? Do you have access to a different machine on which you could cross-check?
@andreaskoepf
Yes, at least latest version last week and deepspeed.
I am using 8xA100(80G) with cuda 11.7.
I have tried reducing pretrain datasets here(only alpaca_gpt4 is reserved) and it can run successfully so I dont think it is the reason of the model.