DeepSpeed Deepspeed not killing process if script exits due to error?

Deepspeed not killing process if script exits due to error?

Open Santosh-Gupta opened this issue 3 years ago • 2 comments

I've been debugging my code to run with deepspeed, and I started noticing cuda out of memory errors. I check nvidia-smi and see that most of my GPU memory is still being used. I check my python processes and see a bunch of instances of my deepspeed scripts, I think for each attempt at running my script.

https://snipboard.io/R9DVu3.jpg

I had to stop each one directly using kill -9. After that, I saw that the GPUs memory was released.

Apr 29 '21 02:04 Santosh-Gupta

It looks like just running one deepspeed script results in around 40 processes being created.

Apr 29 '21 06:04 Santosh-Gupta

I have the same problem when I use deepspeed.init_inference. Did you solve the problem?

May 09 '23 16:05 Luowaterbi

I have the same issue here for larger huggingface model (i.e. 30B) using the deepspeed init_inference. but has no problem with smaller model (6.7B)

Jul 03 '23 03:07 allanj

Hi @allanj - do you know what signal is being sent to your processes? We just added support to clean up gracefully for SIGINT and SIGTERM. Could you try with the latest DeepSpeed from source that should have those?

Aug 18 '23 16:08 loadams

Merged PR that added SIGINT and SIGKILL support - since the original issue is old, closing this. If someone with this issue can open a new issue, link this one, and provide repro steps, I'd love to get this fixed.

Aug 21 '23 17:08 loadams

DeepSpeed DeepSpeed copied to clipboard

Deepspeed not killing process if script exits due to error?

DeepSpeed
DeepSpeed copied to clipboard