DeepSpeed
DeepSpeed copied to clipboard
Deepspeed not killing process if script exits due to error?
I've been debugging my code to run with deepspeed, and I started noticing cuda out of memory errors. I check nvidia-smi
and see that most of my GPU memory is still being used. I check my python processes and see a bunch of instances of my deepspeed scripts, I think for each attempt at running my script.
https://snipboard.io/R9DVu3.jpg
I had to stop each one directly using kill -9
. After that, I saw that the GPUs memory was released.
It looks like just running one deepspeed script results in around 40 processes being created.
I have the same problem when I use deepspeed.init_inference
. Did you solve the problem?
I have the same issue here for larger huggingface model (i.e. 30B) using the deepspeed init_inference. but has no problem with smaller model (6.7B)
Hi @allanj - do you know what signal is being sent to your processes? We just added support to clean up gracefully for SIGINT and SIGTERM. Could you try with the latest DeepSpeed from source that should have those?
Merged PR that added SIGINT and SIGKILL support - since the original issue is old, closing this. If someone with this issue can open a new issue, link this one, and provide repro steps, I'd love to get this fixed.