mmdetection3d
mmdetection3d copied to clipboard
FP16 training error: RuntimeError: expected scalar type Half but found Float
Thanks for your error report and we appreciate it a lot.
Checklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug I am running SECOND with fp16 on Kitti dataset and got the error.
Reproduction
- What command or script did you run?
./tools/dist_train.sh mmdetection3d/configs/fp16/hv_second_secfpn_fp16_6x8_80e_kitti-3d-3class.py 1
- Did you make any modifications on the code or config? Did you understand what you have modified?
Currently NO.
- What dataset did you use?
KITTI dataset.
Environment
- Please run
python mmdet3d/utils/collect_env.py
to collect necessary environment infomation and paste it here.
$ python mmdet3d/utils/collect_env.py
sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 1.6.0+cu101
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.7.0+cu101
OpenCV: 4.5.4-dev
MMCV: 1.3.8
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMDetection: 2.14.0
MMSegmentation: 0.14.1
MMDetection3D: 0.16.0+85547d
- You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as
$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)
Error traceback If applicable, paste the error trackback here.
File "mmdet3d/ops/spconv/functional.py", line 65, in forward
indice_pair_num, num_activate_out, False, True)
File "mmdet3d/ops/spconv/ops.py", line 119, in indice_conv
int(subm))
RuntimeError: expected scalar type Half but found Float
Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
Currently NO.
[UPDATE] This error only occurs using torch>=1.6, where torch.cuda.amp is used for fp16 training. I've tested it can be successfully run with torch==1.5.0, where mmcv-style mix-precision training is adopted here.
Another question about dynamic voxelization with FP16.
I am trying to reimplement dv with SECOND based on mix-precision training. However, I found that the scatter(perhaps c++ implementation) only supports float training, while the vfe layer only allows half inputs. Therefore, I have to hack the code in mmdet3d/models/voxel_encoder/voxel_encoder.py
line 268 with:
for i, vfe in enumerate(self.vfe_layers):
features = features.half() # convert to half for vfe inputs
point_feats = vfe(features)
if (i == len(self.vfe_layers) - 1 and self.fusion_layer is not None
and img_feats is not None):
point_feats = self.fusion_layer(img_feats, points, point_feats,
img_metas)
point_feats = point_feats.float() # convert to float for vfe_scatter
voxel_feats, voxel_coors = self.vfe_scatter(point_feats, coors)
if i != len(self.vfe_layers) - 1:
feat_per_point = self.map_voxel_center_to_point(
coors, voxel_feats, voxel_coors)
features = torch.cat([point_feats, feat_per_point], dim=1)
Although this code runs well with 1 or 4 GPU(s) distributed training, it fails if I run with 8 GPUs. The error is reported as follows:
Traceback (most recent call last):
File "./tools/train.py", line 224, in <module>
main()
File "./tools/train.py", line 220, in main
meta=meta)
File "mmdet3d/apis/train.py", line 35, in train_model
meta=meta)
File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.call_hook('after_train_iter')
File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 259, in after_train_iter
scaled_loss.backward()
File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/nfs/chenzehui/others/miniconda3/envs/aav2/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:69, unhandled cuda error, NCCL version 2.4.8
This error does not exists on torch1.6 perhaps due to the automatic conversion between fp16<-->fp32 with AMP (I am not sure), while mmcv-fp16 does not. The most direct solution is to implement vfe_scatter with half support. But is there any a fast workaround?
@zehuichen123 Hi, I meet the same error, have you solved the problem?
I meet the same error, I fix the bug through "pip install spconv-cu113", my pytorch version is 1.21 , CUDA version is 11.3 I find that sponv implement in mmcv is not compatible with fp16, so I pip install spconv-cu113 and fix it
@zehuichen123 Hi, I meet the same error, have you solved the problem?
look my reply, may you try it
@zehuichen123 Hi, I meet the same error, have you solved the problem?
look my reply, may you try it
Will try it.
Using spconv2.0 instead of mmcv.ops.spconv
could solve the problem. And the former is recommended with less memory consumption.
@zehuichen123 Hi, I meet the same error, have you solved the problem?
look my reply, may you try it
Hi, i meet the same problem. Although i've tried your solution, the problem still exist. Is there any other reasons causing the problem? If you know, looking forward for your reply and thanks.