UniSRec icon indicating copy to clipboard operation
UniSRec copied to clipboard

How to pretrain in multi gpus?

Open cherrylambo opened this issue 1 year ago • 5 comments

Interested in the pretraining process of UniSRec, I followed the instruction of README.md to pretrain in multi gpus. All the codes were downloaded from this github correctly. image

I run the code with:

CUDA_VISIBLE_DEVICES=0,1,2,3 python ddp_pretrain.py

With long time waiting, however, it came with the error:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

How can I deal with it?

cherrylambo avatar Dec 15 '23 02:12 cherrylambo

Have you solved it

louhangyu avatar Mar 12 '24 03:03 louhangyu

Sorry for the late reply! I guess it's the version mismatch of PyTorch or something. Could you please share the versions of python, torch, cudatoolkit, and recbole in your environment, which could be really helpful to debug? Thanks! @louhangyu @cherrylambo

hyp1231 avatar Mar 19 '24 07:03 hyp1231

python==3.9.7 pytorch==1.11.0 cudatoolkit==11.3.1 recbole=1.1.1 机器是A100

louhangyu avatar Mar 19 '24 08:03 louhangyu

python==3.9.7 pytorch==1.11.0 cudatoolkit==11.3.1 recbole=1.1.1 机器是A100

Thanks! I'll try to reproduce the bug and get back to you as soon as I can.

hyp1231 avatar Mar 19 '24 08:03 hyp1231

Hi, yupeng. I think the RuntimeError: Default process group has not been initialized, please make sure to call init_process_group error is likely caused by the fact that the torch.distributed.init_process_group function is called inside the _build_distribute method, which is called from the __init__ method of the DDPPretrainTrainer class.

Maybe, you can try to move the initialization of the process group outside the class and before creating an instance of the DDPPretrainTrainer class, like:

def pretrain(rank, world_size, dataset, **kwargs):
    # Initialize the process group outside the class
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

    ...

    # trainer loading and initialization
    trainer = DDPPretrainTrainer(config, model)

    # model pre-training
    trainer.pretrain(pretrain_data, show_progress=(rank == 0))

    dist.destroy_process_group()

    return config['model'], config['dataset']

By moving the process group initialization outside the class and before creating an instance of the DDPPretrainTrainer class, the updated code ensures that the process group is properly initialized before any distributed operations are performed. I think it may resolve this error.

However, I haven't had this problem either. I hope it will helps.

HeyWeCome avatar Mar 29 '24 03:03 HeyWeCome