transfer-learning-conv-ai icon indicating copy to clipboard operation
transfer-learning-conv-ai copied to clipboard

invalid device ordinal

Open gdet opened this issue 5 years ago • 2 comments

Hello,

I followed the steps of your article and I have install pytorch with Cuda like this

   pip3 install torch torchvision

I have python 3.7, torch 1.1.0 , ubuntu 18.04. When I am trying to run this command

  python -m torch.distributed.launch --nproc_per_node=8 ./train.py

I get this error

  WARNING:./train.py:Running process 2
  THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049306803/work/torch/csrc/cuda/Module.cpp line=37 error=101 : invalid device ordinal
 Traceback (most recent call last):
 File "./train.py", line 267, in <module>
 train()
 File "./train.py", line 147, in train
  torch.cuda.set_device(args.local_rank)
 File "/home/hatzimin/.conda/envs/maria_env/lib/python3.7/site-packages/torch/cuda/__init__.py", 
  line 300, in set_device
torch._C._cuda_setDevice(device). 

I searched the error but I haven't managed to find a solution. If I try to run python ./train.py I get no error.

Thank you

gdet avatar Jan 09 '20 13:01 gdet

How many GPU do you have on your machine? You need nproc_per_node= number of GPU on your machine.

sshleifer avatar Jan 15 '20 18:01 sshleifer

I have four. I had changed the number from 8 to 4 but one of them was already used so I got this error. Thank you!

gdet avatar Jan 20 '20 13:01 gdet