accelerate
accelerate copied to clipboard
Let the launched script handle KeyboardInterrupt
Is your feature request related to a problem? Please describe. As part of my training script the model checkpoint is compressed via 7zip. In my code I handle KeyboardInterrupt so I can skip the compression if necessary (it takes a long time). When I do ctrl+c the signal isn't caught by my script, it's caught by Accelerate which exits.
Describe the solution you'd like Accelerate should not interfere with signals and should pass them to the child process for it to decide a course of action. If the child process exits, Accelerate does too, otherwise it keeps running.
Describe alternatives you've considered
I tried using signal to set my own signal handler, which didn't work. Here's what happens:
Run the script via accelerate launch test.py
test.py is as follows:
def signal_handler(signal, frame):
print('got ctrl+c')
signal.signal(signal.SIGINT, signal_handler)
while True:
print('loop')
time.sleep(1)
This test is interesting because:
got ctrl+cis printed- Accelerate exits but the child script does not
loopkeeps printing in the terminal until you dokill [pid of child]
Additional context None
cc @muellerzr
Is there any solutions to the problem? I need the similar feature in my model training loop, where Ctrl+C is used as a stop signal for all processes. And then evaluation should be taken for current model parameters.
Now that we require a minimum of python 3.8, this is on the roadmap to look into due to new capabilities in the language
Noticed this as well. If you use ctrl+c after launching with accelerate launch you basically have to go into htop and look for the children as they are still hogging GPU+RAM. Seems like there should be a better way to exit if you use accelerate launch.
You'll still find this via torchrun too I believe, so I'm not sure how much we can truly do
@muellerzr im guessing you know much more than me but could you track the children process ids and atexit.register some hook that kills them? im not aware of how torchrun works under the hood so maybe thats not entirely feasible