s2anet icon indicating copy to clipboard operation
s2anet copied to clipboard

作者您好,想咨询一下可变形卷积的梯度检查问题

Open HsLOL opened this issue 4 years ago • 1 comments

作者您好,我仿照roi pool中的gradcheck.py脚本写了一个可形变卷积的梯度检查代码,可变形卷积的前向结果是可以输出的,但是在进行到gradcheck()函数部分,却提示我说需要分配40000.00 GiB的内存到GPU上,然后报显存不足的错误,下面是我的代码,想问一下出现上述问题的原因,是我写的代码的问题呢?还是源码的问题呢?

import os.path as osp
import torch
from torch.autograd import gradcheck

from DCN_cuda_extension.deform_conv import DeformConv

device = 3
offset = torch.ones(2, 72, 80, 80, requires_grad=True).cuda(device=device)
cls_feat = torch.ones(2, 256, 80, 80, requires_grad=True).cuda(device=device)

############ deform conv output
kernel_size = 3
dcn = DeformConv(in_channels=256,
                 out_channels=256,
                 kernel_size=kernel_size,
                 padding=(kernel_size - 1) // 2,
                 deformable_groups=4).cuda(device=device) # put DeformConv module on GPU Device
aligned_feature = dcn(cls_feat, offset)
print(f'dcn output: {aligned_feature.shape}')
inputs = (cls_feat, offset)
########### Gradcheck
test = gradcheck(func=dcn,
                 inputs=inputs,
                 eps=1e-5,
                 atol=1e-3)
print(test)

报错如下:

dcn output: torch.Size([2, 256, 80, 80])
/home/***/anaconda3/envs/apex/lib/python3.7/site-packages/torch/autograd/gradcheck.py:302: UserWarning: The {}th input requires gradient and is not a double precision floating point or complex. This check will likely fail if all the inputs are not of double precision floating point or complex. 
  'The {}th input requires gradient and '
Traceback (most recent call last):
  File "/home/***/Templates/op/Deform_conv_cuda/gradcheck.py", line 38, in <module>
    atol=1e-3)
  File "/home/***/anaconda3/envs/apex/lib/python3.7/site-packages/torch/autograd/gradcheck.py", line 345, in gradcheck
    nondet_tol=nondet_tol)
  File "/home/***/anaconda3/envs/apex/lib/python3.7/site-packages/torch/autograd/gradcheck.py", line 173, in get_analytical_jacobian
    jacobian = make_jacobian(input, output.numel())
  File "/home/***/anaconda3/envs/apex/lib/python3.7/site-packages/torch/autograd/gradcheck.py", line 29, in make_jacobian
    lambda x: x is not None, (make_jacobian(elem, num_out) for elem in input)))
  File "/home/***/anaconda3/envs/apex/lib/python3.7/site-packages/torch/autograd/gradcheck.py", line 29, in <genexpr>
    lambda x: x is not None, (make_jacobian(elem, num_out) for elem in input)))
  File "/home/***/anaconda3/envs/apex/lib/python3.7/site-packages/torch/autograd/gradcheck.py", line 26, in make_jacobian
    return input.new_zeros((input.nelement(), num_out), dtype=input.dtype, layout=torch.strided)
RuntimeError: CUDA out of memory. Tried to allocate 40000.00 GiB (GPU 3; 23.65 GiB total capacity; 43.27 MiB already allocated; 22.64 GiB free; 48.00 MiB reserved in total by PyTorch)

Process finished with exit code 1

HsLOL avatar Oct 23 '21 13:10 HsLOL

建议减小offset和feat尺寸 参考:https://github.com/CharlesShang/DCNv2/blob/803ff20f52ea655f2fb903e8a786139c1726b104/test.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L67

csuhan avatar Oct 25 '21 17:10 csuhan