mmgeneration
mmgeneration copied to clipboard
Errors enountered when fine tuning models (RuntimeError: No such operator aten::cudnn_convolution_backward_weight)
Checklist
- I have searched related issues but cannot get the expected help.
- I have read the FAQ documentation but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug I encountered errors while fine tuning a StyleGAN2 model.
Reproduction
- What command or script did you run?
I have tried using both dist_train.sh and train.py and none of them works. (stylegan2_c2_ffhq_256_43.py is a config I modified and data/43 is where I put my training images)
dist_train.sh:
bash tools/dist_train.sh configs/styleganv2/stylegan2_c2_ffhq_256_43.py 1 --work-dir work_dirs/stylegan2_c2_ffhq_256_43
train.py:
python tools/train.py configs/styleganv2/stylegan2_c2_ffhq_256_43.py --work-dir work_dirs/stylegan2_c2_ffhq_256_43
- Did you make any modifications on the code or config? Did you understand what you have modified?
**I modified my configs according to "configs/styleganv2/stylegan2_c2_ffhq_256_b4x8_800k.py". I changed the following things:
- train_pipeline and val_pipeline, I added
dict(type='Resize', keys=['real_img'], scale=(256, 256))to resize the images - imgs_root of train and val to './data/43', where I put my training images
- total_iters to 3000
- num_images in metrics and evaluation to 300, the number of my training images
- load_from to
'checkpoints/stylegan2_c2_ffhq_256_b4x8_20210407_160709-7890ae1f.pth', where the checkpoint is located**
- What dataset did you use? My own training images, I put them in 'data/43'
Environment
- Please run
python mmgen/utils/collect_env.pyto collect necessary environment information and paste it here.
sys.platform: linux
Python: 3.7.13 (default, Apr 24 2022, 01:04:09) [GCC 7.5.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.1.TC455_06.29190527_0
GPU 0: Tesla T4
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.11.0+cu113
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.12.0+cu113
OpenCV: 4.1.2
MMCV: 1.5.3
MMGen: 0.7.1+8bea27b
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.1
- You may add addition that may be helpful for locating the problem I have tried all these in both Colab and in my local machine, both do not work.
Error traceback If I run dist_train.sh:
Traceback (most recent call last):
File "tools/train.py", line 228, in <module>
main()
File "tools/train.py", line 224, in main
meta=meta)
File "/content/mmgeneration/mmgen/apis/train.py", line 207, in train_model
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/content/mmgeneration/mmgen/core/runners/dynamic_iterbased_runner.py", line 285, in run
iter_runner(iter_loaders[i], **kwargs)
File "/content/mmgeneration/mmgen/core/runners/dynamic_iterbased_runner.py", line 215, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/parallel/distributed.py", line 59, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/content/mmgeneration/mmgen/models/gans/static_unconditional_gan.py", line 262, in train_step
loss_gen, log_vars_g = self._get_gen_loss(data_dict_)
File "/content/mmgeneration/mmgen/models/gans/base_gan.py", line 85, in _get_gen_loss
loss_ = loss_module(outputs_dict)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/content/mmgeneration/mmgen/models/losses/gen_auxiliary_loss.py", line 261, in forward
**kwargs)
File "/content/mmgeneration/mmgen/models/losses/gen_auxiliary_loss.py", line 104, in gen_path_regularizer
only_inputs=True)[0]
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 277, in grad
allow_unused, accumulate_grad=False) # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/content/mmgeneration/mmgen/ops/conv2d_gradfix.py", line 207, in backward
grad_weight = Conv2dGradWeight.apply(grad_output, input)
File "/content/mmgeneration/mmgen/ops/conv2d_gradfix.py", line 250, in forward
return torch._C._jit_get_operation(name)(weight_shape, grad_output,
RuntimeError: No such operator aten::cudnn_convolution_backward_weight
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1822) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED
If I run train.py:
Traceback (most recent call last):
File "tools/train.py", line 228, in <module>
main()
File "tools/train.py", line 224, in main
meta=meta)
File "/content/mmgeneration/mmgen/apis/train.py", line 207, in train_model
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/content/mmgeneration/mmgen/core/runners/dynamic_iterbased_runner.py", line 285, in run
iter_runner(iter_loaders[i], **kwargs)
File "/content/mmgeneration/mmgen/core/runners/dynamic_iterbased_runner.py", line 207, in train
kwargs.update(dict(ddp_reducer=self.model.reducer))
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1186, in __getattr__
type(self).__name__, name))
AttributeError: 'MMDataParallel' object has no attribute 'reducer'
Anyone got a solution for this? Thanks.
Hi @oschan77, I'll reply to you after reproducing your problem and this issue may be helpful. And you are not supposed to run train.py directly since DP is not supported in MMGen.
Thank you @plyfager, but running dist_train.sh also does not work for me.
Hi @plyfager, I am still suffering from this issue. Any idea to solve this? Thanks!
Im also suffering from this problem, any idea?
@oschan77 Seems that your system Cuda version (11.1) and PyTorch Cuda version (11.3) do not match.
@asafberreby Please run python mmgen/utils/collect_env.py to collect necessary environment information.
this is what im getting while runing mmgen/utils/collect_env.py:
sys.platform: linux
Python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda-11.5
NVCC: Build cuda_11.5.r11.5/compiler.30411180_0
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA GeForce GT 1030
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.11.0+cu115
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.5
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.3.3
- Built with CuDNN 8.3.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.5, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.12.0+cu115
OpenCV: 4.5.4-dev
MMCV: 1.5.0
MMGen: 0.7.1+3542102
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: 11.5
Update: I downgraded some packages, now it works. New env:
PyTorch: 1.10.2+cu111
TorchVision: 0.11.3+cu111
OpenCV: 4.6.0
MMCV: 1.5.3
MMGen: 0.7.1+3542102
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: 11.1
I have the same issue.
...
File "/mmgeneration/mmgen/ops/conv2d_gradfix.py", line 207, in backward
grad_weight = Conv2dGradWeight.apply(grad_output, input)
File "/mmgeneration/mmgen/ops/conv2d_gradfix.py", line 250, in forward
return torch._C._jit_get_operation(name)(weight_shape, grad_output,
RuntimeError: No such operator aten::cudnn_convolution_backward_weight
...
My env:
sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.6.r11.6/compiler.31057947_0
GPU 0: NVIDIA GeForce RTX 3090
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.12.0
OpenCV: 4.6.0
MMCV: 1.5.3
MMGen: 0.7.1+3542102
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3