MeZO
MeZO copied to clipboard
Getting a RuntimeError after training with mezo
Hello, Thank you for sharing your work! I'm getting the error below after training with the mezo.sh script:
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
The problem persists when I use multiple GPUs. Thanks!
Hi,
It looks like an error related to multi processing. Can you report the pytorch/transformers versions here?
Hello, Thanks for the reply. I first used pytorch 1.13 + transformers 4.29.2 but then updated to pytorch 2.0.1 + transformers 4.29.2 and the issue persists. If I make the number of eval steps larger than the number of tine-tuning steps so that evaluation is only done at end of training then the error disappear but the model evaluated at the end has poor performance (~50% for SST2 task, which is random performance).
Also I forget to say that the error was encountered in the large models case. Thanks!
Hi,
Do you mind posting a full error log? Also, have you tried just using single GPU?
Hi, Here is the full error log:
Traceback (most recent call last):
File "../mezo/large_models/run.py", line 533, in <module>
main()
File "../mezo/large_models/run.py", line 498, in main
framework.train(train_samples, dev_samples if dev_samples is not None else eval_samples)
File "../mezo/large_models/run.py", line 433, in train
trainer.train(resume_from_checkpoint=last_checkpoint)
File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "../mezo/large_models/trainer.py", line 660, in _inner_training_loop
dist.barrier()
File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3327, in barrier
default_pg = _get_default_group()
File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 707, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Yes, I tried both single and multi GPU.
It's weird because for single GPU it shouldn't use the distributed training part of torch.
This line
File "../mezo/large_models/trainer.py", line 660, in _inner_training_loop
will only be triggered when args.local_rank != -1, which means you are using multiple GPUs.
Can you make sure you are not using multiple GPUs and not using torchrun, accelerate, or srun?
This error happens when using Transformers>4.28. Starting from 4.29, the accelerate is required for transformers, and somehow it made localrank no longer -1, which is needed for trainer.py included in Mezo to differentiate between single and multi-gpu run. I suggest to update the trainer.py to accommodate new behavior of Transformers 4.29 or newer
Transformers 4.28 does not give any errors.