UniSRec
UniSRec copied to clipboard
How to pretrain in multi gpus?
Interested in the pretraining process of UniSRec, I followed the instruction of README.md to pretrain in multi gpus. All the codes were downloaded from this github correctly.
I run the code with:
CUDA_VISIBLE_DEVICES=0,1,2,3 python ddp_pretrain.py
With long time waiting, however, it came with the error:
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
How can I deal with it?
Have you solved it
Sorry for the late reply! I guess it's the version mismatch of PyTorch or something. Could you please share the versions of python
, torch
, cudatoolkit
, and recbole
in your environment, which could be really helpful to debug? Thanks! @louhangyu @cherrylambo
python==3.9.7 pytorch==1.11.0 cudatoolkit==11.3.1 recbole=1.1.1 机器是A100
python==3.9.7 pytorch==1.11.0 cudatoolkit==11.3.1 recbole=1.1.1 机器是A100
Thanks! I'll try to reproduce the bug and get back to you as soon as I can.
Hi, yupeng. I think the RuntimeError: Default process group has not been initialized, please make sure to call init_process_group error
is likely caused by the fact that the torch.distributed.init_process_group
function is called inside the _build_distribute
method, which is called from the __init__
method of the DDPPretrainTrainer
class.
Maybe, you can try to move the initialization of the process group outside the class and before creating an instance of the DDPPretrainTrainer
class, like:
def pretrain(rank, world_size, dataset, **kwargs):
# Initialize the process group outside the class
dist.init_process_group("nccl", rank=rank, world_size=world_size)
...
# trainer loading and initialization
trainer = DDPPretrainTrainer(config, model)
# model pre-training
trainer.pretrain(pretrain_data, show_progress=(rank == 0))
dist.destroy_process_group()
return config['model'], config['dataset']
By moving the process group initialization outside the class and before creating an instance of the DDPPretrainTrainer
class, the updated code ensures that the process group is properly initialized before any distributed operations are performed. I think it may resolve this error.
However, I haven't had this problem either. I hope it will helps.