RecBole Multi-GPU training is getting stuck in testing phase or throwing EOFError: Ran out of input

Multi-GPU training is getting stuck in testing phase or throwing EOFError: Ran out of input

Open diesel248 opened this issue 1 year ago • 2 comments

Describe the bug

Distributed training is getting stuck in the testing phase after loading saved model or throwing the EOFError: Ran out of input by running the following command from source

python run_recbole.py --model=SASRec --loss_type=BPR --dataset=ml-100k --nproc=2 --gpu_id=0,1

Desktop (please complete the following information):

OS: Linux
RecBole: 1.1.1
Python: 3.9.13
PyTorch: 1.12.1
cudatoolkit: 11.3.1

Jul 07 '23 16:07 diesel248

Hello! @diesel248 I tried the same command as yours, but I didn't succeed in reproducing your problem. It is recommended that you download our latest code from github and refer to our documentation to try it out.

Jul 08 '23 03:07 zhengbw0324

@zhengbw0324 i am facing the same issue. I have a single machine with 4 GPU. I am using the same command --nproc=4 --gpu_id='0,1,2,3'. Is there something I am missing ? If i use nproc=1 then the training happens only 1 GPU.

Nov 08 '23 22:11 christopheralex

RecBole RecBole copied to clipboard

Multi-GPU training is getting stuck in testing phase or throwing EOFError: Ran out of input

RecBole
RecBole copied to clipboard