SA-SSD icon indicating copy to clipboard operation
SA-SSD copied to clipboard

CUDA OOM Error

Open ootts opened this issue 4 years ago • 5 comments

I am trying to train the SA-SSD model, but I encounter with CUDA out-of-memory error. I tried to use batch size 1, but the OOM error remains. I am using TITAN V with 12.7G memory, and my pytorch version is 1.2.0.

ootts avatar Apr 20 '20 00:04 ootts

You need to install torch 1.1 btw, but i create a dockerfile which i use to train model - enen on 2060 6gb with bs 1. Also found that on v100 memory usage growths - about 15GB with default settings. @skyhehe123 , did you see this fact before? I didn't know why memory change across GPUs.

stalkermustang avatar Apr 20 '20 01:04 stalkermustang

You need to install torch 1.1 btw, but i create a dockerfile which i use to train model - enen on 2060 6gb with bs 1. Also found that on v100 memory usage growths - about 15GB with default settings. @skyhehe123 , did you see this fact before? I didn't know why memory change across GPUs.

I install pytorch 1.1.0, but memory is still not enough..

ootts avatar Apr 20 '20 02:04 ootts

Do you build DOcker image and try train inside?

stalkermustang avatar Apr 20 '20 10:04 stalkermustang

Do you build DOcker image and try train inside?

I use the DOcker image as a reference because it is slightly different from my environment. I install spconv using the commands in the DOcker image, and install pytorch1.1.0 and torchvision0.3.0. But OOM still remains.

ootts avatar Apr 21 '20 00:04 ootts

I think I find the problem. It has nothing to do with this repo but spconv, which has bugs on TITAN V GPU. TITAN XP is fine.

ootts avatar Apr 23 '20 00:04 ootts