RecBole icon indicating copy to clipboard operation
RecBole copied to clipboard

Multi-GPU training is getting stuck in testing phase or throwing EOFError: Ran out of input

Open diesel248 opened this issue 1 year ago • 2 comments

Describe the bug

Distributed training is getting stuck in the testing phase after loading saved model or throwing the EOFError: Ran out of input by running the following command from source

python run_recbole.py --model=SASRec --loss_type=BPR --dataset=ml-100k --nproc=2 --gpu_id=0,1

Desktop (please complete the following information):

  • OS: Linux
  • RecBole: 1.1.1
  • Python: 3.9.13
  • PyTorch: 1.12.1
  • cudatoolkit: 11.3.1

diesel248 avatar Jul 07 '23 16:07 diesel248

Hello! @diesel248 I tried the same command as yours, but I didn't succeed in reproducing your problem. It is recommended that you download our latest code from github and refer to our documentation to try it out.

zhengbw0324 avatar Jul 08 '23 03:07 zhengbw0324

@zhengbw0324 i am facing the same issue. I have a single machine with 4 GPU. I am using the same command --nproc=4 --gpu_id='0,1,2,3'. Is there something I am missing ? If i use nproc=1 then the training happens only 1 GPU.

christopheralex avatar Nov 08 '23 22:11 christopheralex