3D-ResNets-PyTorch icon indicating copy to clipboard operation
3D-ResNets-PyTorch copied to clipboard

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generated/../THCReduceAll.cuh:339 terminate called after throwing an instance of 'std::runtime_error' what(): cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCStorage.c:184

Open vateye opened this issue 6 years ago • 15 comments

dataset loading [0/3570] dataset loading [1000/3570] dataset loading [2000/3570] dataset loading [3000/3570] dataset loading [0/1530] dataset loading [1000/1530] run train at epoch 1 Epoch: [1][1/112] Time 4.807 (4.807) Data 2.836 (2.836) Loss 3.9053 (3.9053) Acc 0.000 (0.000) /pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:101: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed. THCudaCheck FAIL file=/pytorch/torch/lib/THC/generated/../THCReduceAll.cuh line=339 error=59 : device-side assert triggered Traceback (most recent call last): File "main.py", line 137, in train_logger, train_batch_logger) File "/media/ole/Document/Ubuntu/Research/3D-ResNets-PyTorch/train.py", line 31, in train_epoch acc = calculate_accuracy(outputs, targets) File "/media/ole/Document/Ubuntu/Research/3D-ResNets-PyTorch/utils.py", line 58, in calculate_accuracy n_correct_elems = correct.sum().data[0] RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generated/../THCReduceAll.cuh:339 terminate called after throwing an instance of 'std::runtime_error' what(): cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCStorage.c:184

vateye avatar Mar 26 '18 05:03 vateye

Could you tell me how do you execute when the error occurs.

kenshohara avatar Apr 03 '18 01:04 kenshohara

@vateye please check with your n_finetune_classes because that you might got the error

hareeshdevarakonda avatar May 31 '18 06:05 hareeshdevarakonda

@vateye for hmdn51 which pre_trained model you used to get 64.7% accuracy?

I am using resnext-101-kinetics-hmdb51_split1.pth but not getting these accuracies

how you did this please let me know

hareeshdevarakonda avatar May 31 '18 13:05 hareeshdevarakonda

I am getting a similar error:

dataset loading [0/3570] dataset loading [1000/3570] dataset loading [2000/3570] dataset loading [3000/3570] run train at epoch 1 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generated/../THCReduceAll.cuh line=339 error=59 : devic e-side assert triggered Traceback (most recent call last): File "src/3D-ResNets-PyTorch/main.py", line 152, in train_logger, train_batch_logger) File "/home/tgillis/D3M/124_157_HMDB51/124_157_HMDB51_solution/src/3D-ResNets-PyTorch/train.py", line 31, in train_epoch acc = calculate_accuracy(outputs, targets) File "/home/tgillis/D3M/124_157_HMDB51/124_157_HMDB51_solution/src/3D-ResNets-PyTorch/utils.py", line 58, in calculate_accuracy n_correct_elems = correct.float().sum().data[0] RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/gener ated/../THCReduceAll.cuh:339 terminate called after throwing an instance of 'std::runtime_error' what(): cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/ THCStorage.c:184

This is from running:

python src/3D-ResNets-PyTorch/main.py --root_path src/3D-ResNets-PyTorch \
--video_path data/hmdb51/jpg/ --annotation_path data/hmdb51/hmdb51_2.json \
--result_path data/hmdb51/fold2/results --dataset hmdb51 --model resnet --model_depth 50 \
--resnet_shortcut B --n_classes 400 --n_finetune_classes 51 --pretrain_path models/resnet-50-kinetics.pth \
--ft_begin_index 4 --batch_size 128 --n_threads 4 --checkpoint 5 --no_val --test --test_subset val

I also get a similar error when using hmdb51_3.json:

dataset loading [0/3570] dataset loading [1000/3570] dataset loading [2000/3570] dataset loading [3000/3570] run train at epoch 1 /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THCUNN/ClassNLLCriterion.cu:101: void cunn_ClassNLLCriterion_updateOutput_kernel( Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [18,0, 0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THCUNN/ClassNLLCriterion.cu:101: void cunn_ClassNLLCriterion_updateOutput_kernel( Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [19,0, 0] Assertion t >= 0 && t < n_classes failed. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generated/../THCReduceAll.cuh line=339 error=59 : devic e-side assert triggered Traceback (most recent call last): File "src/3D-ResNets-PyTorch/main.py", line 152, in train_logger, train_batch_logger) File "/home/tgillis/D3M/124_157_HMDB51/124_157_HMDB51_solution/src/3D-ResNets-PyTorch/train.py", line 31, in train_epoch acc = calculate_accuracy(outputs, targets) File "/home/tgillis/D3M/124_157_HMDB51/124_157_HMDB51_solution/src/3D-ResNets-PyTorch/utils.py", line 58, in calculate_accuracy n_correct_elems = correct.float().sum().data[0] RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/gener ated/../THCReduceAll.cuh:339 terminate called after throwing an instance of 'std::runtime_error' what(): cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/ THCStorage.c:184

However if I change the annotation_path to point to hmdb51_1.json, it trains fine.

tegillis avatar Jul 10 '18 13:07 tegillis

When I run the same commands above without CUDA I get this using hmdb51_2.json and hmdb51_3.json:

dataset loading [0/3570] dataset loading [1000/3570] dataset loading [2000/3570] dataset loading [3000/3570] run train at epoch 1 Traceback (most recent call last): File "src/3D-ResNets-PyTorch/main.py", line 152, in train_logger, train_batch_logger) File "/home/tgillis/D3M/124_157_HMDB51/124_157_HMDB51_solution/src/3D-ResNets-PyTorch/train.py", line 30, in train_epoch loss = criterion(outputs, targets) File "/home/tgillis/anaconda3/envs/3D_resnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/home/tgillis/anaconda3/envs/3D_resnet/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 679, in forward self.ignore_index, self.reduce) File "/home/tgillis/anaconda3/envs/3D_resnet/lib/python3.6/site-packages/torch/nn/functional.py", line 1161, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, size_average, ignore_index, reduce) File "/home/tgillis/anaconda3/envs/3D_resnet/lib/python3.6/site-packages/torch/nn/functional.py", line 1052, in nll_loss return torch._C._nn.nll_loss(input, target, weight, size_average, ignore_index, reduce) RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed. at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THNN/generic/ClassNLLCriterion.c:87

tegillis avatar Jul 10 '18 15:07 tegillis

I figured out the issue, get_labels() in hmdb51_json.py was adding an extra empty string label from the jsons generated from the previous folds.

Adding the following at line 41 in get_labels.py then regenerating the annotation jsons should fix this:

if name[-4:] == 'json':
    continue

tegillis avatar Jul 10 '18 15:07 tegillis

@tegillis excuse me ,i want to train this code with hmdb51 datasets,but i get some problem,have you trained it?

FelixZhang7 avatar Aug 26 '18 08:08 FelixZhang7

@FelixZhang7 I was able to train it on a subset of the dataset, 600 training videos across 5 classes

tegillis avatar Aug 28 '18 15:08 tegillis

@tegillis could you please tell me how to train it? i use python main.py --root_path ~/3D-ResNets-PyTorch --video_path data/hmdb51_videos/jpg --annotation_path hmdb51_1.json --result_path results --dataset hmdb51 --model resnet --model_depth 34 --n_classes 51 --batch_size 128 --n_threads 16 --checkpoint 20 but i got 0 acc and 0 loss in train.log and train_batch.log.

FelixZhang7 avatar Aug 29 '18 01:08 FelixZhang7

@FelixZhang7 I'm not sure, I fine-tuned one of the pretrained models. This is what I ran:

python src/3D-ResNets-PyTorch/main.py --root_path src/3D-ResNets-PyTorch --video_path data/hmdb51/jpg-tmp/ --dataset hmdb51 --model resnet --model_depth 18 --resnet_shortcut A --n_classes 400 --n_finetune_classes 7 --pretrain_path models/resnet-18-kinetics.pth --ft_begin_index 4 --batch_size 32 --n_threads 8 --checkpoint 5 --test --test_subset val --learning_rate 0.001 --weight_decay 0.00005 --n_epochs 20 --no_val --annotation_path data/hmdb51/d3m_1.json --result_path data/hmdb51/d3m_results

tegillis avatar Aug 30 '18 20:08 tegillis

@tegillis I want to use this code to train my own dataset,i set my dataset as hmdb51 format,but it seems the author does not give the instruction of training on hmdb51,so i am confused...

FelixZhang7 avatar Aug 31 '18 07:08 FelixZhang7

@FelixZhang7 you're going to have to modify the pre-processing scripts to work with your dataset.

tegillis avatar Sep 19 '18 14:09 tegillis

dataset loading [0/3570] dataset loading [1000/3570] dataset loading [2000/3570] dataset loading [3000/3570] dataset loading [0/1530] dataset loading [1000/1530] run train at epoch 1 Epoch: [1][1/112] Time 4.807 (4.807) Data 2.836 (2.836) Loss 3.9053 (3.9053) Acc 0.000 (0.000) /pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:101: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed. THCudaCheck FAIL file=/pytorch/torch/lib/THC/generated/../THCReduceAll.cuh line=339 error=59 : device-side assert triggered Traceback (most recent call last): File "main.py", line 137, in train_logger, train_batch_logger) File "/media/ole/Document/Ubuntu/Research/3D-ResNets-PyTorch/train.py", line 31, in train_epoch acc = calculate_accuracy(outputs, targets) File "/media/ole/Document/Ubuntu/Research/3D-ResNets-PyTorch/utils.py", line 58, in calculate_accuracy n_correct_elems = correct.sum().data[0] RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generated/../THCReduceAll.cuh:339 terminate called after throwing an instance of 'std::runtime_error' what(): cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCStorage.c:184

you can change targets = Variable(targets) to targets = Variable(targets)-1 in train.py

ghost avatar Jul 30 '19 09:07 ghost

It seems get_labels module in hmdb51_json will produce a null string as a class.

Change the module as follows will work.

def get_labels(csv_dir_path): labels = [] for file_path in csv_dir_path.iterdir(): label='_'.join(file_path.name.split('_')[:-2]) if label != '': labels.append(label) return sorted(list(set(labels)))

I figured out the issue, get_labels() in hmdb51_json.py was adding an extra empty string label from the jsons generated from the previous folds.

Adding the following at line 41 in get_labels.py then regenerating the annotation jsons should fix this:

if name[-4:] == 'json':
    continue

erinchen824 avatar Jul 20 '20 07:07 erinchen824

It seems get_labels module in hmdb51_json will produce a null string as a class.

Change the module as follows will work.

def get_labels(csv_dir_path): labels = [] for file_path in csv_dir_path.iterdir(): label='_'.join(file_path.name.split('_')[:-2]) if label != '': labels.append(label) return sorted(list(set(labels)))

I figured out the issue, get_labels() in hmdb51_json.py was adding an extra empty string label from the jsons generated from the previous folds. Adding the following at line 41 in get_labels.py then regenerating the annotation jsons should fix this:

if name[-4:] == 'json':
    continue

thank you very much

TETEYJDA avatar Jan 19 '22 03:01 TETEYJDA