DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Deepspeed not killing process if script exits due to error?

Open Santosh-Gupta opened this issue 3 years ago • 2 comments

I've been debugging my code to run with deepspeed, and I started noticing cuda out of memory errors. I check nvidia-smi and see that most of my GPU memory is still being used. I check my python processes and see a bunch of instances of my deepspeed scripts, I think for each attempt at running my script.

https://snipboard.io/R9DVu3.jpg

I had to stop each one directly using kill -9. After that, I saw that the GPUs memory was released.

Santosh-Gupta avatar Apr 29 '21 02:04 Santosh-Gupta

It looks like just running one deepspeed script results in around 40 processes being created.

Santosh-Gupta avatar Apr 29 '21 06:04 Santosh-Gupta

I have the same problem when I use deepspeed.init_inference. Did you solve the problem?

Luowaterbi avatar May 09 '23 16:05 Luowaterbi

I have the same issue here for larger huggingface model (i.e. 30B) using the deepspeed init_inference. but has no problem with smaller model (6.7B)

allanj avatar Jul 03 '23 03:07 allanj

Hi @allanj - do you know what signal is being sent to your processes? We just added support to clean up gracefully for SIGINT and SIGTERM. Could you try with the latest DeepSpeed from source that should have those?

loadams avatar Aug 18 '23 16:08 loadams

Merged PR that added SIGINT and SIGKILL support - since the original issue is old, closing this. If someone with this issue can open a new issue, link this one, and provide repro steps, I'd love to get this fixed.

loadams avatar Aug 21 '23 17:08 loadams