TransFG
TransFG copied to clipboard
How to fix the RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx
Thanks for your work and sharing your codes!
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port 89898 train.py --dataset CUB_200_2011 --split overlap --num_steps 10000 --fp16 --name sample_run
When I train on two gpus(1080TI *2), it is current. the configuration is CUDA 11.1, pythorch 1.8.1, torchvision 0.9.1, python 3.8.3
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
Training (X / X Steps) (loss=X.X): 0%|| 0/749 [00:00<?, ?it/s]Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
Training (X / X Steps) (loss=X.X): 0%|| 0/749 [00:42<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 400, in <module>
main()
File "train.py", line 397, in main
train(args, model)
File "train.py", line 226, in train
loss, logits = model(x, y)
File "/home/lirunze/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lirunze/anaconda3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, **kwargs)
File "/home/lirunze/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lirunze/anaconda3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/_initialize.py", line 196, in new_fwd
output = old_fwd(*applier(args, input_caster),
File "/home/lirunze/xh/project/git/trans-fg_-i2-t/models/modeling.py", line 305, in forward
part_logits = self.part_head(part_tokens[:, 0])
File "/home/lirunze/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lirunze/anaconda3/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/home/lirunze/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
Could you analyze the problem about this? Thank you!
How did you solve this problem?
Because of high pytorch's version, please use the pytorch 1.7.1 or 1.5.1 given from author.
信件已经收到啦~