SoftGroup
SoftGroup copied to clipboard
RuntimeErorr: Cuda out of memory.
I'm running the latest version of the program. I can train successfully using the softgroup_s3dis_backbone_fold5.yaml file. When training with the softgroup_s3dis_fold5.yaml file, when the first Epoch training is finished and validation starts, the progress bar starts to report an error after reaching 11%. RuntimeErorr: Cuda out of memory. I tried to reduce the learning rate to 0.001 but found that it did not work. I only have a 3080 video card with 10G of video memory, is there any way to continue the training if I want to?
The S3DIS scenes are very large with up to millions of points, so evaluation of this dataset requires a lot of memory. You can skip validation by using --skip_validate
.
I skip the verification step and finally complete the entire training of S3DIS, but I still get an error when testing: RuntimeErorr: Cuda out of memory. I would like to ask if it is possible to test with one graphics card?
We run S3DIS on large GPU mem. I think 10G is not enough to run inference on S3DIS. One work around is to add continue
in tools/test.py
when meet scans with large number of points (e.g. 500k).
In the latest code, I have the same problem, my GPU is: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 GPU 2: NVIDIA GeForce RTX 3090 GPU 3: NVIDIA GeForce RTX 3090 if I need other large GPU? The old code no problem.
are you having problem with training + validation
YES.
Maybe distributed training requires more GPU memory. You can add --skip_validate
to training. Then evaluate after finishing training.
OK, Thank you
I skip the verification step and finally complete the entire training of S3DIS, but I still get an error when testing: RuntimeErorr: Cuda out of memory. I would like to ask if it is possible to test with one graphics card?
Excuse me, have you solved your problem? I have the same problem as you.Where should 'continue' be added to 'tools/test.py'?
Maybe distributed training requires more GPU memory. You can add
--skip_validate
to training. Then evaluate after finishing training.
@thangvubk --skip_validate
only worked in traing, it can't solve the problem in inferencing
I run ./tools/dist_test.sh configs/softgroup_s3dis_fold5.yaml work_dirs/softgroup_s3dis_fold5/epoch_20.pth 8
in 8x24G RTX 3090, it also raised the error
RuntimeError: CUDA out of memory. Tried to allocate 8.45 GiB (GPU 4; 23.70 GiB total capacity; 10.61 GiB already allocated; 7.96 GiB free; 14.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The S3DIS scenes are very large with up to millions of points, so evaluation of this dataset requires a lot of memory. You can skip validation by using
--skip_validate
.
When training on Scannet, i met the problem when validate: "RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::CUDAError'"
How can i deal with it?
When training on Scannet, i met the problem when validate: "RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::CUDAError'"
How can i deal with it?
That problem is mostly due to the GPU capabilities. I don't know what's the average number of points per scan in the ScanNet dataset, but I figured out two options when facing this issue:
· Option 1: Reduce the val_dataset memory weight
When I worked with my own custom data in the STPLS3D dataset backbone I had the same problem and I fixed it by cropping my scenes. In the original work the scenes were cropped in cells of 25 meters and in my case I could overcome this problem by cropping the scene in cells of 10 meters (without losing the geometrical meaning).
The ScanNet dataset is a different case and maybe cropping the scene could help. However, a possible solution is randomly downsampling the validation point clouds and then preprocess the validation set again. If the problem persists, keep trying to downsampling the validation point clouds (but be careful with the geometrical meaning, visualize them before test the model)
· Option 2: Buy and add new GPU(s) :)
Hello,I think this problem is " pin_memory " this option. If you set this option True, the data loading is faster. But you need more GPU memory.(2x3090 are not work) So I set it from True to False , I can train and validate successfully. But when validating, you need 100GB+ RAM. I guess.
My configuration RAM 125.5GIB GPU 2x3090(24GB)