llm-foundry
llm-foundry copied to clipboard
fix retry crash loop
When a run is killed and automatically restarts after training is complete, it is thrown into a retry loop due to Composer erroring when no training will happen. This PR short-circuits to the checkpoint upload step when this error is detected during a retry.
Manual tests:
milo-gvMjEA(initial)milo-vsiPgi(retry)