structure_knowledge_distillation
structure_knowledge_distillation copied to clipboard
training fails with "RuntimeError: cuda runtime error (11) : invalid argument at THCGeneral.cpp:405"
RTX 2080 Ti
python 3.7.7 hcff3b4d_5
cuda100 1.0 0 pytorch
pytorch 0.4.1 py37_py36_py35_py27__9.0.176_7.1.2_2 pytorch
torchvision 0.2.1 py_2 pytorch
CUDA Version 10.2.89
cudnn 7.6.4
I have succesfully run :
sh run_test.sh
but after trying :
sh run_train_val.sh
I go the error (details below)
RuntimeError: cuda runtime error (11) : invalid argument at THCGeneral.cpp:405 #1566
I have tried the following tips below but the same error remains.
RuntimeError: cuda runtime error (11) : invalid argument at THCGeneral.cpp:405 #1566
https://github.com/fastai/fastai/issues/1566
conda install -c pytorch cuda100
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=383 error=11 : invalid argument #21154
https://github.com/pytorch/pytorch/issues/21154
Didn't work for me. RTX2080, Cuda 10, Pytorch 1.3. :(
any ideas?
thanks again for all your help
(structure_knowledge_distillation) user@voyager% sh run_train_val.sh
INFO D_att_ckpt_path : ./ckpt/save_path/Att_discriminator
INFO D_ckpt_path : ./ckpt/save_path/Distriminator
INFO D_resume : True
INFO S_ckpt_path : ./ckpt/save_path/Student
INFO S_resume : True
INFO T_ckpt_path : ./ckpt/Teacher/CS_scenes_38413_0.7832174615268139.pth
INFO adv_conv_dim : 64
INFO adv_loss_type : wgan-gp
INFO batch_size : 8
INFO best_mean_IU : 0.0
INFO classes_num : 19
INFO data_dir : ./data/cityscapes
INFO data_list : ./dataset/list/cityscapes/train.lst
INFO data_set : cityscape
INFO device : cuda
INFO epoch_nums : 1
INFO gpu : 0
INFO gpu_num : 1
INFO ho : True
INFO ignore_label : 255
INFO imsize_for_adv : 65
INFO input_size : 512,512
INFO is_student_load_imgnet : True
INFO is_training : False
INFO lambda_d : 0.1
INFO lambda_gp : 10.0
INFO lambda_pa : 0.5
INFO lambda_pi : 10.0
INFO last_step : 0
INFO log_path : ./ckpt/log/save_path
INFO lr_d : 0.0004
INFO lr_g : 0.01
INFO momentum : 0.9
INFO num_steps : 40000
INFO pa : True
INFO parallel : False
INFO pi : True
INFO pool_scale : 0.5
INFO power : 0.9
INFO preprocess_GAN_mode : 1
INFO random_mirror : True
INFO random_scale : True
INFO recurrence : 1
INFO save_name : save_path
INFO snapshot_dir : ./snapshots/
INFO start_epoch : 0
INFO student_pretrain_model_imgnet : ./dataset/resnet18-imagenet.pth
INFO weight_decay : 0.0005
321300 images are loaded!
500 images are loaded!
INFO ------------
INFO => load./dataset/resnet18-imagenet.pth
INFO ------------
INFO student_model: Number of params: 13.07M
INFO ------------
INFO => no teacher ckpt find
INFO ------------
INFO teacher_model: Number of params: 70.44M
INFO ------------
INFO => checkpoint './ckpt/save_path/Distriminator/model_best.pth.tar' does not exit
INFO ------------
INFO D_model: Number of params: 3.20M
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
File "train_and_eval.py", line 25, in <module>
model.optimize_parameters()
File "/home/user/work/projects/structure_knowledge_distillation/networks/kd_model.py", line 168, in optimize_parameters
self.forward()
File "/home/user/work/projects/structure_knowledge_distillation/networks/kd_model.py", line 122, in forward
self.preds_T = self.parallel_teacher.eval()(self.images, parallel=args.parallel)
File "/home/user/anaconda3/envs/structure_knowledge_distillation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/work/projects/structure_knowledge_distillation/utils/parallel.py", line 106, in forward
return super().forward(inputs, **kwargs)
File "/home/user/anaconda3/envs/structure_knowledge_distillation/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/user/anaconda3/envs/structure_knowledge_distillation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/work/projects/structure_knowledge_distillation/networks/pspnet_combine.py", line 177, in forward
x = self.relu1(self.bn1(self.conv1(x)))
File "/home/user/anaconda3/envs/structure_knowledge_distillation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/anaconda3/envs/structure_knowledge_distillation/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp:663
(structure_knowledge_distillation) user@voyager%
torch 0.41 is not matched with cuda 10 for RTX 2080, you need to either update torch version or degrade Cuda to 9.0, but RTX 2080 may fail.
torch 0.41 is not matched with cuda 10 for RTX 2080, you need to either update torch version or degrade Cuda to 9.0, but RTX 2080 may fail.
Hi, I just got exactly the same issue here. However, I am using cuda-9.0-pytorch-0.4.1 docker with python=3.5(followed the instruction). Do you have any idea about that?