metaseq Rewrite of the load_checkpoint function

This is a rewrite of how we determine which checkpoint to load when starting/restarting a training run. (Originally there was also a refactor of how our different checkpoint paths are processed, but I seperated this out for now)

Previously we had a quite brittle logic for this, with edge cases where metaseq would not load the correct checkpoint. For example here: https://github.com/facebookresearch/metaseq/issues/544

The new logic checks first all possible sources for checkpoints (restore-file, finetune-from, local checkpoints, nfs / azure checkpoints), and assigns them priority based on their progress in training and prefers local caches.

It then takes the most recent checkpoint and copies it to local disk.

To test this you need both metaseq / metaseq-internal PR's. Here: https://github.com/fairinternal/metaseq-internal/pull/842

I tested:

multi-node and single node local start
with and without nfs cloud upload path
crashing a run and correct continuing
with and without finetune-from
with and without resume-file
evals

What I didn't test yet, is if starting with an azure blob path is working.

Feb 14 '23 14:02 Xirider

Will test this on azure soon.

Feb 14 '23 16:02 Xirider

Heads up: https://github.com/facebookresearch/metaseq/pull/646 will likely go in first since tests are passing there (after loss parity check is added). There will probably be merge conflicts after, but hopefully not too bad.

Feb 15 '23 06:02 suchenzang