py-R-FCN-multiGPU icon indicating copy to clipboard operation
py-R-FCN-multiGPU copied to clipboard

GPU memory not release after interrupted the training script

Open LearnerInGithub opened this issue 8 years ago • 3 comments

Hi, @bharatsingh430 , I faced a problem that the GPU memory not released normally after I interrupted the training script, in details saying, I used 2 GPU, like [0,1], while I pressed the Ctrl+C to stop the training script, then I prompt the nvidia-smi to see the GPU usage, found that only GPU 1 was normally released the used memory and GPU 0 still keep the allocated memory, even wait for a long time, the problem still there, so want to ask which reasons may caused such problem? And how could I fixed it? PS: I tried kill the Python process, but it not work. Waiting for your help! Thank you very much!

LearnerInGithub avatar Jun 26 '17 16:06 LearnerInGithub

you can use ps aux | grep caffe, get the pids and use kill -9 pid. that should work.

bharatsingh430 avatar Jun 26 '17 16:06 bharatsingh430

@bharatsingh430 It's not always work, at this time, the processes has been killed by following your instructions, however, the occupied GPU memory still there and GPU usage was 100%, so how to clear the occupied GPU memory? Thank you!

LearnerInGithub avatar Jun 28 '17 08:06 LearnerInGithub

I encounter the same question,the process become a zumbia process which parent process is "init".So I have to reboot my machine.

smuelpeng avatar Aug 24 '17 10:08 smuelpeng