accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Let the launched script handle KeyboardInterrupt

Open Cyberes opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe. As part of my training script the model checkpoint is compressed via 7zip. In my code I handle KeyboardInterrupt so I can skip the compression if necessary (it takes a long time). When I do ctrl+c the signal isn't caught by my script, it's caught by Accelerate which exits.

Describe the solution you'd like Accelerate should not interfere with signals and should pass them to the child process for it to decide a course of action. If the child process exits, Accelerate does too, otherwise it keeps running.

Describe alternatives you've considered I tried using signal to set my own signal handler, which didn't work. Here's what happens:

Run the script via accelerate launch test.py

test.py is as follows:

def signal_handler(signal, frame):
    print('got ctrl+c')

signal.signal(signal.SIGINT, signal_handler)

while True:
    print('loop')
    time.sleep(1)

This test is interesting because:

  1. got ctrl+c is printed
  2. Accelerate exits but the child script does not
  3. loop keeps printing in the terminal until you do kill [pid of child]

Additional context None

Cyberes avatar Dec 16 '22 22:12 Cyberes

cc @muellerzr

sgugger avatar Dec 19 '22 08:12 sgugger

Is there any solutions to the problem? I need the similar feature in my model training loop, where Ctrl+C is used as a stop signal for all processes. And then evaluation should be taken for current model parameters.

XuHwang avatar Jun 29 '23 04:06 XuHwang

Now that we require a minimum of python 3.8, this is on the roadmap to look into due to new capabilities in the language

muellerzr avatar Jun 29 '23 12:06 muellerzr

Noticed this as well. If you use ctrl+c after launching with accelerate launch you basically have to go into htop and look for the children as they are still hogging GPU+RAM. Seems like there should be a better way to exit if you use accelerate launch.

grahamannett avatar Aug 10 '23 02:08 grahamannett

You'll still find this via torchrun too I believe, so I'm not sure how much we can truly do

muellerzr avatar Aug 10 '23 02:08 muellerzr

@muellerzr im guessing you know much more than me but could you track the children process ids and atexit.register some hook that kills them? im not aware of how torchrun works under the hood so maybe thats not entirely feasible

grahamannett avatar Aug 10 '23 22:08 grahamannett