bevfusion icon indicating copy to clipboard operation
bevfusion copied to clipboard

cuda execution failed with error

Open qfwysw opened this issue 3 years ago • 7 comments

The code works fine when I set the gpu <=2. When I set the gpu ==4, the error occurs. What can I do to slove it ? image

qfwysw avatar Aug 12 '22 06:08 qfwysw

Typically this is an error related to OOM. Please check whether there are other processes running on the GPUs you are using. Here is a solution that can reduce the inference memory.

By the way, We will release a new version of the codebase that consumes significantly less memory than the current version in the near future.

kentang-mit avatar Aug 12 '22 07:08 kentang-mit

I use the A100 48G to inference. But I get the same result.

qfwysw avatar Aug 12 '22 07:08 qfwysw

It's better if you can provide more details about your experiments. For example: command used, full error log, the environment (versions of key libraries, type of GPU), etc. I will work on a docker version soon to avoid these problems in the future.

kentang-mit avatar Aug 12 '22 16:08 kentang-mit

python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT tools/test_mmdet3d.py $CONFIG --launcher pytorch ${@:3} image image I just replaced the test.py with the one from mmdet3d. The program works fine when the number of gpus is less than 3. I think it should have nothing to do with the test.py.

qfwysw avatar Aug 13 '22 02:08 qfwysw

I think you probably want to try removing lines 22-25 in this file and rerun python setup.py develop. Our current code is not compiled for P40 by default (should be sm61/compute61).

kentang-mit avatar Aug 13 '22 07:08 kentang-mit

Sorry, It's all my fault. I replaced the test.py with the one from mmdet3d. But it doesn't open the muti tasks. However, adding -H localhost:port with different port to avoid port conflicts didn't work. image

qfwysw avatar Aug 13 '22 08:08 qfwysw

image image I print the tcp. From the results, setting the port does not seem to have any effect

qfwysw avatar Aug 13 '22 10:08 qfwysw

For the TCP problem, I would suggest you to open a separate issue and clearly describe the problem. I remember that in this issue we do have similar discussions and the problem was solved at that time.

kentang-mit avatar Aug 17 '22 21:08 kentang-mit