composer
composer copied to clipboard
Can automatic save checkpoint when crashed or press Ctrl+C?
🚀 Feature Request
Can I save latest checkpoint when crashed or press Ctrl+C?
Motivation
[Optional] Implementation
Additional context
@BoyuanJiang, hey I think we do support this, but it's not very well documented. In particular, I think if we press Ctrl + C once, then the model should checkpoint? Let me know if that doesn't work
@bcui19 it seems when press Ctrl+C, this code will be executed, it just kill the program without saving latest state.
@BoyuanJiang in that code snippet, the following should hold processes for timeout length https://github.com/mosaicml/composer/blob/dev/composer/cli/launcher.py#L406-L416
Do you see all ranks die immediately if you hit Ctrl + C once?
yes it will hold for 30 second and then all rank will be killed. But I am not sure which line of code will save the latest state to checkpoint in this duration(30 second)?
https://github.com/mosaicml/composer/blob/dev/composer/callbacks/checkpoint_saver.py#L307-L313 On close, it should try to flush a checkpoint. This hasn't been extensively tested though