RecBole
RecBole copied to clipboard
Multi-GPU training is getting stuck in testing phase or throwing EOFError: Ran out of input
Describe the bug
Distributed training is getting stuck in the testing phase after loading saved model or throwing the EOFError: Ran out of input by running the following command from source
python run_recbole.py --model=SASRec --loss_type=BPR --dataset=ml-100k --nproc=2 --gpu_id=0,1
Desktop (please complete the following information):
- OS: Linux
- RecBole: 1.1.1
- Python: 3.9.13
- PyTorch: 1.12.1
- cudatoolkit: 11.3.1
Hello! @diesel248 I tried the same command as yours, but I didn't succeed in reproducing your problem. It is recommended that you download our latest code from github and refer to our documentation to try it out.
@zhengbw0324 i am facing the same issue. I have a single machine with 4 GPU. I am using the same command --nproc=4 --gpu_id='0,1,2,3'. Is there something I am missing ? If i use nproc=1 then the training happens only 1 GPU.