SoftGroup icon indicating copy to clipboard operation
SoftGroup copied to clipboard

RuntimeErorr: Cuda out of memory.

Open Autherparadox opened this issue 2 years ago • 12 comments

I'm running the latest version of the program. I can train successfully using the softgroup_s3dis_backbone_fold5.yaml file. When training with the softgroup_s3dis_fold5.yaml file, when the first Epoch training is finished and validation starts, the progress bar starts to report an error after reaching 11%. RuntimeErorr: Cuda out of memory. I tried to reduce the learning rate to 0.001 but found that it did not work. I only have a 3080 video card with 10G of video memory, is there any way to continue the training if I want to?

Autherparadox avatar Apr 19 '22 08:04 Autherparadox

The S3DIS scenes are very large with up to millions of points, so evaluation of this dataset requires a lot of memory. You can skip validation by using --skip_validate.

thangvubk avatar Apr 19 '22 08:04 thangvubk

I skip the verification step and finally complete the entire training of S3DIS, but I still get an error when testing: RuntimeErorr: Cuda out of memory. I would like to ask if it is possible to test with one graphics card?

Autherparadox avatar Apr 19 '22 16:04 Autherparadox

We run S3DIS on large GPU mem. I think 10G is not enough to run inference on S3DIS. One work around is to add continue in tools/test.py when meet scans with large number of points (e.g. 500k).

thangvubk avatar Apr 19 '22 16:04 thangvubk

In the latest code, I have the same problem, my GPU is: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 GPU 2: NVIDIA GeForce RTX 3090 GPU 3: NVIDIA GeForce RTX 3090 if I need other large GPU? The old code no problem.

Atopis avatar Apr 25 '22 07:04 Atopis

are you having problem with training + validation

thangvubk avatar Apr 25 '22 08:04 thangvubk

YES.

Atopis avatar Apr 25 '22 08:04 Atopis

Maybe distributed training requires more GPU memory. You can add --skip_validate to training. Then evaluate after finishing training.

thangvubk avatar Apr 25 '22 09:04 thangvubk

OK, Thank you

Atopis avatar Apr 25 '22 09:04 Atopis

I skip the verification step and finally complete the entire training of S3DIS, but I still get an error when testing: RuntimeErorr: Cuda out of memory. I would like to ask if it is possible to test with one graphics card?

Excuse me, have you solved your problem? I have the same problem as you.Where should 'continue' be added to 'tools/test.py'?

azeaa avatar Jul 05 '22 01:07 azeaa

Maybe distributed training requires more GPU memory. You can add --skip_validate to training. Then evaluate after finishing training.

@thangvubk --skip_validate only worked in traing, it can't solve the problem in inferencing I run ./tools/dist_test.sh configs/softgroup_s3dis_fold5.yaml work_dirs/softgroup_s3dis_fold5/epoch_20.pth 8 in 8x24G RTX 3090, it also raised the error

RuntimeError: CUDA out of memory. Tried to allocate 8.45 GiB (GPU 4; 23.70 GiB total capacity; 10.61 GiB already allocated; 7.96 GiB free; 14.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

xbbkok avatar Aug 22 '22 01:08 xbbkok

The S3DIS scenes are very large with up to millions of points, so evaluation of this dataset requires a lot of memory. You can skip validation by using --skip_validate.

When training on Scannet, i met the problem when validate: "RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::CUDAError'"

How can i deal with it?

YuerGu avatar Nov 21 '22 11:11 YuerGu

When training on Scannet, i met the problem when validate: "RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::CUDAError'"

How can i deal with it?

That problem is mostly due to the GPU capabilities. I don't know what's the average number of points per scan in the ScanNet dataset, but I figured out two options when facing this issue:

· Option 1: Reduce the val_dataset memory weight

When I worked with my own custom data in the STPLS3D dataset backbone I had the same problem and I fixed it by cropping my scenes. In the original work the scenes were cropped in cells of 25 meters and in my case I could overcome this problem by cropping the scene in cells of 10 meters (without losing the geometrical meaning).

The ScanNet dataset is a different case and maybe cropping the scene could help. However, a possible solution is randomly downsampling the validation point clouds and then preprocess the validation set again. If the problem persists, keep trying to downsampling the validation point clouds (but be careful with the geometrical meaning, visualize them before test the model)

· Option 2: Buy and add new GPU(s) :)

LinoComesana avatar Mar 24 '23 12:03 LinoComesana

Hello,I think this problem is " pin_memory " this option. If you set this option True, the data loading is faster. But you need more GPU memory.(2x3090 are not work) So I set it from True to False , I can train and validate successfully. But when validating, you need 100GB+ RAM. I guess.

My configuration RAM 125.5GIB GPU 2x3090(24GB)

wdczz avatar Oct 27 '23 05:10 wdczz