metaseq icon indicating copy to clipboard operation
metaseq copied to clipboard

Rewrite of the load_checkpoint function

Open Xirider opened this issue 3 years ago • 2 comments

This is a rewrite of how we determine which checkpoint to load when starting/restarting a training run. (Originally there was also a refactor of how our different checkpoint paths are processed, but I seperated this out for now)

Previously we had a quite brittle logic for this, with edge cases where metaseq would not load the correct checkpoint. For example here: https://github.com/facebookresearch/metaseq/issues/544

The new logic checks first all possible sources for checkpoints (restore-file, finetune-from, local checkpoints, nfs / azure checkpoints), and assigns them priority based on their progress in training and prefers local caches.

It then takes the most recent checkpoint and copies it to local disk.

To test this you need both metaseq / metaseq-internal PR's. Here: https://github.com/fairinternal/metaseq-internal/pull/842

I tested:

  • multi-node and single node local start
  • with and without nfs cloud upload path
  • crashing a run and correct continuing
  • with and without finetune-from
  • with and without resume-file
  • evals

What I didn't test yet, is if starting with an azure blob path is working.

Xirider avatar Feb 14 '23 14:02 Xirider

Will test this on azure soon.

Xirider avatar Feb 14 '23 16:02 Xirider

Heads up: https://github.com/facebookresearch/metaseq/pull/646 will likely go in first since tests are passing there (after loss parity check is added). There will probably be merge conflicts after, but hopefully not too bad.

suchenzang avatar Feb 15 '23 06:02 suchenzang