Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

ubuntu 22.04 cuda 11.8 gcc 11.3

Reproduces the problem - code sample

Hello, I am using the BEVFusion project in mmdetection3d. Initially, I was able to train with single LIDAR data successfully. However, when I attempted to train the lidar-camera fusion model, an error occurred with the following message: raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

The previous error is as follows: RuntimeError: /tmp/mmcv/mmcv/ops/csrc/pytorch/cuda/sparse_indice.cu 126 cuda execution failed with error 2 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10975) of binary: /home/dl/anaconda3/envs/openmmlab/bin/python I followed the solution on CSDN and adjusted the batch size to the minimum value of 1, but the error persists. How can I resolve this issue?

Reproduces the problem - command or script

bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py 1 --cfg-options load_from=/home/dl/csl/mmdetection3d/work_dirs/bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d/epoch_20.pth model.img_backbone.init_cfg.checkpoint=/home/dl/csl/mmdetection3d/swint-nuimages-pretrained.pth --amp

Reproduces the problem - error message

Hello, I am using the BEVFusion project in mmdetection3d. Initially, I was able to train with single LIDAR data successfully. However, when I attempted to train the lidar-camera fusion model, an error occurred with the following message: raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

The previous error is as follows: RuntimeError: /tmp/mmcv/mmcv/ops/csrc/pytorch/cuda/sparse_indice.cu 126 cuda execution failed with error 2 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10975) of binary: /home/dl/anaconda3/envs/openmmlab/bin/python I followed the solution on CSDN and adjusted the batch size to the minimum value of 1, but the error persists. How can I resolve this issue?

Additional information

No response

Aug 02 '23 03:08 shingszelam

I have resolved the previous issue, and I also changed dist_train.sh to 1, which allowed it to run successfully.

However, a new problem has arisen. When conducting fusion training, I encountered a CUDA overflow error. My device is an RTX 3090 with 24GB of VRAM. The original paper used distributed training with 8 GPUs on the RTX 3090, each with a batch size of four. I have already adjusted the batch size to 1 during training, and I am using the nuscenes-mini dataset. Despite this, I am still encountering the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 23.68 GiB total capacity; 12.82 GiB already allocated; 1.87 GiB free; 20.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I would like to know how to resolve this issue.

Aug 07 '23 06:08 shingszelam

Same environment, same devices, same error

Aug 11 '23 08:08 FrozVolca

I solved this error. I changed the batch_size from 4 to 2 in train_dataloader in bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py file.

Dec 07 '23 06:12 Mingqj

I also monitored same issue.

Apr 13 '24 09:04 mook0126

mmdetection3d
mmdetection3d copied to clipboard

[Bug] BEVFusion LIDAR-Camera traning :torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Prerequisite

Task

Branch

Environment

Reproduces the problem - code sample

tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Reproduces the problem - command or script

Reproduces the problem - error message

tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Additional information

mmdetection3d mmdetection3d copied to clipboard

[Bug] BEVFusion LIDAR-Camera traning :torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Prerequisite

Task

Branch

Environment

Reproduces the problem - code sample

tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Reproduces the problem - command or script

Reproduces the problem - error message

tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Additional information

mmdetection3d
mmdetection3d copied to clipboard