Open-Assistant RuntimeError: Timed out initializing process group in store based barrier on rank 2

RuntimeError: Timed out initializing process group in store based barrier on rank 2

Open SingL3 opened this issue 2 years ago • 2 comments

trafficstars

I am trying to run pretrain of LLaMA 30b. And here is my running cmd:

deepspeed trainer_sft.py --configs defaults llama-30b-pretrain pretrain --cache_dir $DATA_PATH --output_dir $MODEL_PATH/llama-30b-pre --deepspeed

And after the model was loaded, it stucked for a long time(I think it was 30 mins for the default timeout of pytorch is 30mins). And this error is raised:

RuntimeError: Timed out initializing process group in store based barrier on rank 2 # for all rank

Any solutions?

Aug 02 '23 02:08 SingL3

We have not seen this error during our training runs. Could you try smaller/different models first? Are you using the latest version of deepspeed? Which GPU and cuda version are you using? Do you have access to a different machine on which you could cross-check?

Aug 08 '23 09:08 andreaskoepf

@andreaskoepf Yes, at least latest version last week and deepspeed. I am using 8xA100(80G) with cuda 11.7. I have tried reducing pretrain datasets here(only alpaca_gpt4 is reserved) and it can run successfully so I dont think it is the reason of the model.

Aug 08 '23 11:08 SingL3

Open-Assistant Open-Assistant copied to clipboard

RuntimeError: Timed out initializing process group in store based barrier on rank 2

Open-Assistant
Open-Assistant copied to clipboard