mmdetection3d
mmdetection3d copied to clipboard
[Bug] BEVFusion LIDAR-Camera traning :torch.distributed.elastic.multiprocessing.errors.ChildFailedError
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] I have read the FAQ documentation but cannot get the expected help.
- [X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
ubuntu 22.04 cuda 11.8 gcc 11.3
Reproduces the problem - code sample
Hello, I am using the BEVFusion project in mmdetection3d. Initially, I was able to train with single LIDAR data successfully. However, when I attempted to train the lidar-camera fusion model, an error occurred with the following message: raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures: <NO_OTHER_FAILURES>
The previous error is as follows: RuntimeError: /tmp/mmcv/mmcv/ops/csrc/pytorch/cuda/sparse_indice.cu 126 cuda execution failed with error 2 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10975) of binary: /home/dl/anaconda3/envs/openmmlab/bin/python I followed the solution on CSDN and adjusted the batch size to the minimum value of 1, but the error persists. How can I resolve this issue?
Reproduces the problem - command or script
bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py 1 --cfg-options load_from=/home/dl/csl/mmdetection3d/work_dirs/bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d/epoch_20.pth model.img_backbone.init_cfg.checkpoint=/home/dl/csl/mmdetection3d/swint-nuimages-pretrained.pth --amp
Reproduces the problem - error message
Hello, I am using the BEVFusion project in mmdetection3d. Initially, I was able to train with single LIDAR data successfully. However, when I attempted to train the lidar-camera fusion model, an error occurred with the following message: raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures: <NO_OTHER_FAILURES>
The previous error is as follows: RuntimeError: /tmp/mmcv/mmcv/ops/csrc/pytorch/cuda/sparse_indice.cu 126 cuda execution failed with error 2 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10975) of binary: /home/dl/anaconda3/envs/openmmlab/bin/python I followed the solution on CSDN and adjusted the batch size to the minimum value of 1, but the error persists. How can I resolve this issue?
Additional information
No response
I have resolved the previous issue, and I also changed dist_train.sh to 1, which allowed it to run successfully.
However, a new problem has arisen. When conducting fusion training, I encountered a CUDA overflow error. My device is an RTX 3090 with 24GB of VRAM. The original paper used distributed training with 8 GPUs on the RTX 3090, each with a batch size of four. I have already adjusted the batch size to 1 during training, and I am using the nuscenes-mini dataset. Despite this, I am still encountering the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 23.68 GiB total capacity; 12.82 GiB already allocated; 1.87 GiB free; 20.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I would like to know how to resolve this issue.
Same environment, same devices, same error
I solved this error. I changed the batch_size from 4 to 2 in train_dataloader in bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py file.
I also monitored same issue.