mmdetection3d icon indicating copy to clipboard operation
mmdetection3d copied to clipboard

FP16 training error: RuntimeError: expected scalar type Half but found Float

Open zehuichen123 opened this issue 3 years ago • 6 comments

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug I am running SECOND with fp16 on Kitti dataset and got the error.

Reproduction

  1. What command or script did you run?
./tools/dist_train.sh mmdetection3d/configs/fp16/hv_second_secfpn_fp16_6x8_80e_kitti-3d-3class.py 1
  1. Did you make any modifications on the code or config? Did you understand what you have modified?

Currently NO.

  1. What dataset did you use?

KITTI dataset.

Environment

  1. Please run python mmdet3d/utils/collect_env.py to collect necessary environment infomation and paste it here.
$ python mmdet3d/utils/collect_env.py
sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 1.6.0+cu101
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.7.0+cu101
OpenCV: 4.5.4-dev
MMCV: 1.3.8
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMDetection: 2.14.0
MMSegmentation: 0.14.1
MMDetection3D: 0.16.0+85547d
  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback If applicable, paste the error trackback here.

  File "mmdet3d/ops/spconv/functional.py", line 65, in forward
    indice_pair_num, num_activate_out, False, True)
  File "mmdet3d/ops/spconv/ops.py", line 119, in indice_conv
    int(subm))
RuntimeError: expected scalar type Half but found Float

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Currently NO.

zehuichen123 avatar Nov 17 '21 11:11 zehuichen123

[UPDATE] This error only occurs using torch>=1.6, where torch.cuda.amp is used for fp16 training. I've tested it can be successfully run with torch==1.5.0, where mmcv-style mix-precision training is adopted here.

zehuichen123 avatar Nov 17 '21 14:11 zehuichen123

Another question about dynamic voxelization with FP16. I am trying to reimplement dv with SECOND based on mix-precision training. However, I found that the scatter(perhaps c++ implementation) only supports float training, while the vfe layer only allows half inputs. Therefore, I have to hack the code in mmdet3d/models/voxel_encoder/voxel_encoder.py line 268 with:

for i, vfe in enumerate(self.vfe_layers):
      features = features.half()        # convert to half for vfe inputs
      point_feats = vfe(features)
      if (i == len(self.vfe_layers) - 1 and self.fusion_layer is not None
              and img_feats is not None):
          point_feats = self.fusion_layer(img_feats, points, point_feats,
                                          img_metas)
      point_feats = point_feats.float()       # convert to float for vfe_scatter
      voxel_feats, voxel_coors = self.vfe_scatter(point_feats, coors)
      if i != len(self.vfe_layers) - 1:
          feat_per_point = self.map_voxel_center_to_point(
              coors, voxel_feats, voxel_coors)
          features = torch.cat([point_feats, feat_per_point], dim=1)

Although this code runs well with 1 or 4 GPU(s) distributed training, it fails if I run with 8 GPUs. The error is reported as follows:

Traceback (most recent call last):
  File "./tools/train.py", line 224, in <module>
    main()
  File "./tools/train.py", line 220, in main
    meta=meta)
  File "mmdet3d/apis/train.py", line 35, in train_model
    meta=meta)
  File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/mmdet/apis/train.py", line 170, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 259, in after_train_iter
    scaled_loss.backward()
  File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:69, unhandled cuda error, NCCL version 2.4.8

This error does not exists on torch1.6 perhaps due to the automatic conversion between fp16<-->fp32 with AMP (I am not sure), while mmcv-fp16 does not. The most direct solution is to implement vfe_scatter with half support. But is there any a fast workaround?

zehuichen123 avatar Nov 17 '21 15:11 zehuichen123

@zehuichen123 Hi, I meet the same error, have you solved the problem?

Cc-Hy avatar Aug 30 '22 05:08 Cc-Hy

I meet the same error, I fix the bug through "pip install spconv-cu113", my pytorch version is 1.21 , CUDA version is 11.3 I find that sponv implement in mmcv is not compatible with fp16, so I pip install spconv-cu113 and fix it

brucejiangsaic avatar Sep 01 '22 12:09 brucejiangsaic

@zehuichen123 Hi, I meet the same error, have you solved the problem?

look my reply, may you try it

brucejiangsaic avatar Sep 01 '22 12:09 brucejiangsaic

@zehuichen123 Hi, I meet the same error, have you solved the problem?

look my reply, may you try it

Will try it.

Cc-Hy avatar Sep 14 '22 04:09 Cc-Hy

Using spconv2.0 instead of mmcv.ops.spconv could solve the problem. And the former is recommended with less memory consumption.

ZCMax avatar Nov 25 '22 09:11 ZCMax

@zehuichen123 Hi, I meet the same error, have you solved the problem?

look my reply, may you try it

Hi, i meet the same problem. Although i've tried your solution, the problem still exist. Is there any other reasons causing the problem? If you know, looking forward for your reply and thanks.

sevenlhl avatar Feb 26 '24 08:02 sevenlhl