accelerate Let the launched script handle KeyboardInterrupt

Is your feature request related to a problem? Please describe. As part of my training script the model checkpoint is compressed via 7zip. In my code I handle KeyboardInterrupt so I can skip the compression if necessary (it takes a long time). When I do ctrl+c the signal isn't caught by my script, it's caught by Accelerate which exits.

Describe the solution you'd like Accelerate should not interfere with signals and should pass them to the child process for it to decide a course of action. If the child process exits, Accelerate does too, otherwise it keeps running.

Describe alternatives you've considered I tried using signal to set my own signal handler, which didn't work. Here's what happens:

Run the script via accelerate launch test.py

test.py is as follows:

def signal_handler(signal, frame):
    print('got ctrl+c')

signal.signal(signal.SIGINT, signal_handler)

while True:
    print('loop')
    time.sleep(1)

This test is interesting because:

got ctrl+c is printed
Accelerate exits but the child script does not
loop keeps printing in the terminal until you do kill [pid of child]

Additional context None

Dec 16 '22 22:12 Cyberes

cc @muellerzr

Dec 19 '22 08:12 sgugger

Is there any solutions to the problem? I need the similar feature in my model training loop, where Ctrl+C is used as a stop signal for all processes. And then evaluation should be taken for current model parameters.

Jun 29 '23 04:06 XuHwang

Now that we require a minimum of python 3.8, this is on the roadmap to look into due to new capabilities in the language

Jun 29 '23 12:06 muellerzr

Noticed this as well. If you use ctrl+c after launching with accelerate launch you basically have to go into htop and look for the children as they are still hogging GPU+RAM. Seems like there should be a better way to exit if you use accelerate launch.

Aug 10 '23 02:08 grahamannett

You'll still find this via torchrun too I believe, so I'm not sure how much we can truly do

Aug 10 '23 02:08 muellerzr

@muellerzr im guessing you know much more than me but could you track the children process ids and atexit.register some hook that kills them? im not aware of how torchrun works under the hood so maybe thats not entirely feasible

Aug 10 '23 22:08 grahamannett

accelerate accelerate copied to clipboard

Let the launched script handle KeyboardInterrupt

accelerate
accelerate copied to clipboard