metadata icon indicating copy to clipboard operation
metadata copied to clipboard

feat: save the model and stop training based on `exit-duration-in-mins`

Open SaulLu opened this issue 3 years ago • 2 comments

As discussed a long time ago in a meeting it would be really great if we had a feature to save the model and stop training after a certain time as the jobs on the JZ cluster are limited to 20 hours.

For example, in the architecture and scaling working group, they added the exit-duration-in-mins argument the library used to run trainings Megatron-DeepSpeed


related: #37 (#42)

SaulLu avatar Nov 23 '21 15:11 SaulLu

does #42 serve half of the purpose (saving the model)?

shanyas10 avatar Nov 23 '21 15:11 shanyas10

Indeed your PR #42 is also really useful (it should be merged, I send you a private message about this)

What I have in mind with this issue is more to launch the backup after a certain time as the jobs on JZ are limited to 20h. If I'm not mistaken it's something that is not included in your current PR #42 right?

SaulLu avatar Nov 23 '21 16:11 SaulLu