composer icon indicating copy to clipboard operation
composer copied to clipboard

Can automatic save checkpoint when crashed or press Ctrl+C?

Open BoyuanJiang opened this issue 2 years ago • 5 comments

🚀 Feature Request

Can I save latest checkpoint when crashed or press Ctrl+C?

Motivation

[Optional] Implementation

Additional context

BoyuanJiang avatar Aug 22 '23 12:08 BoyuanJiang

@BoyuanJiang, hey I think we do support this, but it's not very well documented. In particular, I think if we press Ctrl + C once, then the model should checkpoint? Let me know if that doesn't work

bcui19 avatar Aug 23 '23 18:08 bcui19

@bcui19 it seems when press Ctrl+C, this code will be executed, it just kill the program without saving latest state.

BoyuanJiang avatar Aug 31 '23 15:08 BoyuanJiang

@BoyuanJiang in that code snippet, the following should hold processes for timeout length https://github.com/mosaicml/composer/blob/dev/composer/cli/launcher.py#L406-L416

Do you see all ranks die immediately if you hit Ctrl + C once?

mvpatel2000 avatar Aug 31 '23 18:08 mvpatel2000

yes it will hold for 30 second and then all rank will be killed. But I am not sure which line of code will save the latest state to checkpoint in this duration(30 second)?

BoyuanJiang avatar Sep 01 '23 11:09 BoyuanJiang

https://github.com/mosaicml/composer/blob/dev/composer/callbacks/checkpoint_saver.py#L307-L313 On close, it should try to flush a checkpoint. This hasn't been extensively tested though

mvpatel2000 avatar Sep 05 '23 18:09 mvpatel2000