Real-time-GesRec icon indicating copy to clipboard operation
Real-time-GesRec copied to clipboard

cuda gpu device Error

Open parkjh688 opened this issue 4 years ago • 18 comments

Hi.

I have 1 GPU in my computer but I got this error. I'm newbie of Pytorch so I don't know this Error's meaning.

Traceback (most recent call last):
  File "main.py", line 177, in <module>
    train_logger, train_batch_logger)
  File "/home/eden/Real-time-GesRec/train.py", line 34, in train_epoch
    outputs = model(inputs)
  File "/home/eden/anaconda3/envs/gesrec/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/eden/anaconda3/envs/gesrec/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward
    "them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

parkjh688 avatar Aug 06 '19 05:08 parkjh688

I found the reason of error.

When I print t.device and self.src_device_obj in torch data_parallel.py file. I got cpu for t.device and cuda:0 for self.src_device_obj.

I guess the model made for CPU version. Can you tell me how to change CPU to GPU version?

parkjh688 avatar Aug 08 '19 05:08 parkjh688

The models are made for GPU actually. Which version of torch are you using? Are you sure that you are using GPU? You can check it as in https://stackoverflow.com/a/48152675/6400484

ahmetgunduz avatar Aug 08 '19 21:08 ahmetgunduz

Yes I have and I checked it again by using that link.

image

parkjh688 avatar Aug 09 '19 05:08 parkjh688

It seems to be a pytorch bug, Please check this solution https://discuss.pytorch.org/t/bug-in-dataparallel-only-works-if-the-dataset-device-is-cuda-0/28634/18

ahmetgunduz avatar Aug 09 '19 07:08 ahmetgunduz

Hi,

@ahmetgunduz Even I am facing the same issue. Is there any solution for this ?

I checked if torch is able to detect the cuda device (1 GPU in my case), It seems good. I am using the torch version 1.2.

Screenshot from 2019-08-22 13-22-14

I am using the following config just to try out for the offline test on jester.

#!/bin/bash
python offline_test.py \
--root_path ~/ \
--video_path /home/karthik/Desktop/Data/Jester/20bn-jester-v1 \
--annotation_path Desktop/Project/Real-time-GesRec/annotation_Jester/jester.json \
--result_path Desktop/Project/Real-time-GesRec/results \
--resume_path Desktop/Project/Real-time-GesRec/pre-trained-models/jester_resnext_101_RGB_32.pth \
--dataset jester \
--sample_duration 32 \
--learning_rate 0.01 \
--model resnext \
--model_depth 101 \
--batch_size 1 \
--n_classes 27 \
--n_finetune_classes 27 \
--modality RGB \
--n_threads 8 \
--checkpoint 1 \
--train_crop random \
--n_val_samples 1 \
--test_subset val \
--n_epochs 100

@parkjh688 were you able to solve the issue ?

Thanks in advance.

Karthik-Bhaskar avatar Aug 27 '19 09:08 Karthik-Bhaskar

@Karthik-Bhaskar just to check can you please add --no_cuda parameter as well if it is working with cpu.

ahmetgunduz avatar Aug 27 '19 13:08 ahmetgunduz

Should I need to add any value for --no_cuda parameter like True or False.

Or just include without any value like this,

#!/bin/bash
python offline_test.py \
--root_path ~/ \
--video_path /home/karthik/Desktop/Data/Jester/20bn-jester-v1 \
--annotation_path Desktop/Project/Real-time-GesRec/annotation_Jester/jester.json \
--result_path Desktop/Project/Real-time-GesRec/results \
--resume_path Desktop/Project/Real-time-GesRec/pre-trained-models/jester_resnext_101_RGB_32.pth \
--dataset jester \
--sample_duration 32 \
--learning_rate 0.01 \
--model resnext \
--model_depth 101 \
--batch_size 1 \
--n_classes 27 \
--n_finetune_classes 27 \
--modality RGB \
--n_threads 8 \
--checkpoint 1 \
--train_crop random \
--n_val_samples 1 \
--test_subset val \
--n_epochs 100 \
--no_cuda

I tried executing with the above parameters and ran into RuntimeError: Error(s) in loading state_dict for ResNeXt

Please tell me if it's the wrong way to add that parameter.

Thanks.

Karthik-Bhaskar avatar Aug 27 '19 14:08 Karthik-Bhaskar

Everything looks fine actually. The way you gave no_cuda parameter is right. Honestly, I have no clue about the error. It may be because of the torch version, the repo is lastly updated for PyTorch 1.0.1.post2 maybe you can downgrade your pytorch version and try.

ahmetgunduz avatar Aug 27 '19 20:08 ahmetgunduz

I downgraded the PyTorch to 1.0.1.post2 but the issue remains the same. Can you please let me know if I need to use any particular version of the package or library. Currently, I am using Python 3.6 and Cuda 10.

Karthik-Bhaskar avatar Aug 28 '19 09:08 Karthik-Bhaskar

python 3.7.3 and Cuda 10 is the current versions I am using. See below: Screen Shot 2019-08-31 at 15 56 29

ahmetgunduz avatar Aug 31 '19 13:08 ahmetgunduz

Dear @parkjh688 and @Karthik-Bhaskar, did you find any solution for this?

ahmetgunduz avatar Sep 12 '19 21:09 ahmetgunduz

@ahmetgunduz Unfortunately not yet. I will try to run this code with other machine which has another cuda and cudnn version next week to check this problem whether cuda problem or not. But I guess this looks like cuda version problem.

parkjh688 avatar Sep 14 '19 01:09 parkjh688

@parkjh688 That is great! Looking forward to seeing the outcome...

ahmetgunduz avatar Sep 15 '19 22:09 ahmetgunduz

model, parameters = generate_model(opt) model = model.cuda()

Add the sentence above.

xiaomingnio avatar Dec 21 '19 03:12 xiaomingnio

@Karthik-Bhaskar were you able to solve the issue ? RuntimeError: Error(s) in loading state_dict for ResNeXt Thanks.

MrXuf avatar May 22 '20 02:05 MrXuf

the codebase is updated. Could you please pull the repo and recheck ?

ahmetgunduz avatar May 23 '20 14:05 ahmetgunduz

@MrXuf No, I could not resolve it. Recheck with updated codebase as @ahmetgunduz told above.

Karthik-Bhaskar avatar May 23 '20 14:05 Karthik-Bhaskar

Oh!Thank you for your email. I had the same problem and it bothered me for a few days. I will recheck latest code.

------------------ 原始邮件 ------------------ 发件人: "Karthik-Bhaskar"<[email protected]>; 发送时间: 2020年5月23日(星期六) 晚上10:48 收件人: "ahmetgunduz/Real-time-GesRec"<[email protected]>; 抄送: "Mr_Xuf_qq_mail"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [ahmetgunduz/Real-time-GesRec] cuda gpu device Error (#33)

@MrXuf No, I could not resolve it. Recheck with updated codebase as @ahmetgunduz told above.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

MrXuf avatar May 24 '20 13:05 MrXuf