llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

fix retry crash loop

Open milocress opened this issue 10 months ago • 0 comments

When a run is killed and automatically restarts after training is complete, it is thrown into a retry loop due to Composer erroring when no training will happen. This PR short-circuits to the checkpoint upload step when this error is detected during a retry.

Manual tests:

  • milo-gvMjEA (initial)
  • milo-vsiPgi (retry)

milocress avatar Dec 09 '24 19:12 milocress