MeZO icon indicating copy to clipboard operation
MeZO copied to clipboard

Getting a RuntimeError after training with mezo

Open sowmaster opened this issue 2 years ago • 6 comments

Hello, Thank you for sharing your work! I'm getting the error below after training with the mezo.sh script:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

The problem persists when I use multiple GPUs. Thanks!

sowmaster avatar Jul 01 '23 07:07 sowmaster

Hi,

It looks like an error related to multi processing. Can you report the pytorch/transformers versions here?

gaotianyu1350 avatar Jul 01 '23 18:07 gaotianyu1350

Hello, Thanks for the reply. I first used pytorch 1.13 + transformers 4.29.2 but then updated to pytorch 2.0.1 + transformers 4.29.2 and the issue persists. If I make the number of eval steps larger than the number of tine-tuning steps so that evaluation is only done at end of training then the error disappear but the model evaluated at the end has poor performance (~50% for SST2 task, which is random performance).

Also I forget to say that the error was encountered in the large models case. Thanks!

sowmaster avatar Jul 01 '23 23:07 sowmaster

Hi,

Do you mind posting a full error log? Also, have you tried just using single GPU?

gaotianyu1350 avatar Jul 02 '23 18:07 gaotianyu1350

Hi, Here is the full error log:

Traceback (most recent call last):
  File "../mezo/large_models/run.py", line 533, in <module>
    main()
  File "../mezo/large_models/run.py", line 498, in main
    framework.train(train_samples, dev_samples if dev_samples is not None else eval_samples)
  File "../mezo/large_models/run.py", line 433, in train
    trainer.train(resume_from_checkpoint=last_checkpoint) 
  File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "../mezo/large_models/trainer.py", line 660, in _inner_training_loop
    dist.barrier()
  File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3327, in barrier
    default_pg = _get_default_group()
  File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 707, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. 

Yes, I tried both single and multi GPU.

sowmaster avatar Jul 04 '23 04:07 sowmaster

It's weird because for single GPU it shouldn't use the distributed training part of torch.

This line

  File "../mezo/large_models/trainer.py", line 660, in _inner_training_loop

will only be triggered when args.local_rank != -1, which means you are using multiple GPUs.

Can you make sure you are not using multiple GPUs and not using torchrun, accelerate, or srun?

gaotianyu1350 avatar Jul 04 '23 14:07 gaotianyu1350

This error happens when using Transformers>4.28. Starting from 4.29, the accelerate is required for transformers, and somehow it made localrank no longer -1, which is needed for trainer.py included in Mezo to differentiate between single and multi-gpu run. I suggest to update the trainer.py to accommodate new behavior of Transformers 4.29 or newer

Transformers 4.28 does not give any errors.

hibagus avatar Feb 06 '24 10:02 hibagus