mmgeneration Errors enountered when fine tuning models (RuntimeError: No such operator aten::cudnn_convolution_backward

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug I encountered errors while fine tuning a StyleGAN2 model.

Reproduction

What command or script did you run?

I have tried using both dist_train.sh and train.py and none of them works. (stylegan2_c2_ffhq_256_43.py is a config I modified and data/43 is where I put my training images)

dist_train.sh:

bash tools/dist_train.sh configs/styleganv2/stylegan2_c2_ffhq_256_43.py 1 --work-dir work_dirs/stylegan2_c2_ffhq_256_43

train.py:

python tools/train.py configs/styleganv2/stylegan2_c2_ffhq_256_43.py --work-dir work_dirs/stylegan2_c2_ffhq_256_43

Did you make any modifications on the code or config? Did you understand what you have modified?

**I modified my configs according to "configs/styleganv2/stylegan2_c2_ffhq_256_b4x8_800k.py". I changed the following things:

train_pipeline and val_pipeline, I added dict(type='Resize', keys=['real_img'], scale=(256, 256)) to resize the images
imgs_root of train and val to './data/43', where I put my training images
total_iters to 3000
num_images in metrics and evaluation to 300, the number of my training images
load_from to 'checkpoints/stylegan2_c2_ffhq_256_b4x8_20210407_160709-7890ae1f.pth', where the checkpoint is located**

What dataset did you use? My own training images, I put them in 'data/43'

Environment

Please run python mmgen/utils/collect_env.py to collect necessary environment information and paste it here.

sys.platform: linux
Python: 3.7.13 (default, Apr 24 2022, 01:04:09) [GCC 7.5.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.1.TC455_06.29190527_0
GPU 0: Tesla T4
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.11.0+cu113
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.12.0+cu113
OpenCV: 4.1.2
MMCV: 1.5.3
MMGen: 0.7.1+8bea27b
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.1

You may add addition that may be helpful for locating the problem I have tried all these in both Colab and in my local machine, both do not work.

Error traceback If I run dist_train.sh:

Traceback (most recent call last):
  File "tools/train.py", line 228, in <module>
    main()
  File "tools/train.py", line 224, in main
    meta=meta)
  File "/content/mmgeneration/mmgen/apis/train.py", line 207, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/content/mmgeneration/mmgen/core/runners/dynamic_iterbased_runner.py", line 285, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/content/mmgeneration/mmgen/core/runners/dynamic_iterbased_runner.py", line 215, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/parallel/distributed.py", line 59, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/content/mmgeneration/mmgen/models/gans/static_unconditional_gan.py", line 262, in train_step
    loss_gen, log_vars_g = self._get_gen_loss(data_dict_)
  File "/content/mmgeneration/mmgen/models/gans/base_gan.py", line 85, in _get_gen_loss
    loss_ = loss_module(outputs_dict)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/mmgeneration/mmgen/models/losses/gen_auxiliary_loss.py", line 261, in forward
    **kwargs)
  File "/content/mmgeneration/mmgen/models/losses/gen_auxiliary_loss.py", line 104, in gen_path_regularizer
    only_inputs=True)[0]
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 277, in grad
    allow_unused, accumulate_grad=False)  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/content/mmgeneration/mmgen/ops/conv2d_gradfix.py", line 207, in backward
    grad_weight = Conv2dGradWeight.apply(grad_output, input)
  File "/content/mmgeneration/mmgen/ops/conv2d_gradfix.py", line 250, in forward
    return torch._C._jit_get_operation(name)(weight_shape, grad_output,
RuntimeError: No such operator aten::cudnn_convolution_backward_weight
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1822) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/train.py FAILED

If I run train.py:

Traceback (most recent call last):
  File "tools/train.py", line 228, in <module>
    main()
  File "tools/train.py", line 224, in main
    meta=meta)
  File "/content/mmgeneration/mmgen/apis/train.py", line 207, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/content/mmgeneration/mmgen/core/runners/dynamic_iterbased_runner.py", line 285, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/content/mmgeneration/mmgen/core/runners/dynamic_iterbased_runner.py", line 207, in train
    kwargs.update(dict(ddp_reducer=self.model.reducer))
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1186, in __getattr__
    type(self).__name__, name))
AttributeError: 'MMDataParallel' object has no attribute 'reducer'

Jun 21 '22 06:06 oschan77

Anyone got a solution for this? Thanks.

Jun 27 '22 06:06 oschan77

Hi @oschan77, I'll reply to you after reproducing your problem and this issue may be helpful. And you are not supposed to run train.py directly since DP is not supported in MMGen.

Jun 27 '22 06:06 plyfager

Thank you @plyfager, but running dist_train.sh also does not work for me.

Jun 27 '22 06:06 oschan77

Hi @plyfager, I am still suffering from this issue. Any idea to solve this? Thanks!

Jul 08 '22 02:07 oschan77

Im also suffering from this problem, any idea?

Aug 16 '22 11:08 asafberreby

@oschan77 Seems that your system Cuda version (11.1) and PyTorch Cuda version (11.3) do not match. @asafberreby Please run python mmgen/utils/collect_env.py to collect necessary environment information.

Aug 16 '22 11:08 LeoXing1996

this is what im getting while runing mmgen/utils/collect_env.py:

sys.platform: linux
Python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda-11.5
NVCC: Build cuda_11.5.r11.5/compiler.30411180_0
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA GeForce GT 1030
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.11.0+cu115
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.5
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.3
    - Built with CuDNN 8.3.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.5, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.12.0+cu115
OpenCV: 4.5.4-dev
MMCV: 1.5.0
MMGen: 0.7.1+3542102
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: 11.5

Aug 16 '22 12:08 asafberreby

Update: I downgraded some packages, now it works. New env:

PyTorch: 1.10.2+cu111
TorchVision: 0.11.3+cu111
OpenCV: 4.6.0
MMCV: 1.5.3
MMGen: 0.7.1+3542102
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: 11.1

I have the same issue.

...
  File "/mmgeneration/mmgen/ops/conv2d_gradfix.py", line 207, in backward
    grad_weight = Conv2dGradWeight.apply(grad_output, input)
  File "/mmgeneration/mmgen/ops/conv2d_gradfix.py", line 250, in forward
    return torch._C._jit_get_operation(name)(weight_shape, grad_output,
RuntimeError: No such operator aten::cudnn_convolution_backward_weight
...

My env:

sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.6.r11.6/compiler.31057947_0
GPU 0: NVIDIA GeForce RTX 3090
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.12.0
OpenCV: 4.6.0
MMCV: 1.5.3
MMGen: 0.7.1+3542102
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3

Aug 18 '22 13:08 user41pp

mmgeneration
mmgeneration copied to clipboard

Errors enountered when fine tuning models (RuntimeError: No such operator aten::cudnn_convolution_backward_weight)

mmgeneration mmgeneration copied to clipboard

Errors enountered when fine tuning models (RuntimeError: No such operator aten::cudnn_convolution_backward_weight)

mmgeneration
mmgeneration copied to clipboard