composer icon indicating copy to clipboard operation
composer copied to clipboard

[Checkpoint] Fix symlink issue where symlink file uploaded before checkpoint files upload

Open bigning opened this issue 1 year ago • 2 comments
trafficstars

What does this PR do?

Fix the symlink issues. How? [updated]: in the checkpoint saver, on rank-0 which saves the symlink, it all_gather the remote checkpoint file names, and start a new process to check if those remote files finish uploading by calling object_store.get_object_size. It only upload symlink file once all the remote file finish uploading. This way:

  1. it won't block the training, since it's a separate process
  2. it almost has no delay to upload the symlink file (it will sleep 30s if remote file is not there yet)

Unit test

Integration test

2-nodes OCI:

save: test-uploader-0Tkv9O autoresume: test-uploader-yac88U

2-nodes mflow:

save: l38bi-full-sweep-train-bb-1-0e-6-5-VMY2Xo load: l38bi-full-sweep-train-bb-1-0e-6-5-TyRcPi

Daily test:

https://github.com/mosaicml/composer/actions/runs/9700144963

composer regression test:

https://github.com/databricks-mosaic/regression-testing/actions/runs/9700161085

Perf test (100 batches with 9 batch save interval. The training time varies because of unstable uploading speed, but just want to make sure test didn't regress training time)

64 gpu test: 77b-bs1024-g512-res2-f60-37gSXK time: 26 minutes 64 GPU baseline: 77b-bs1024-g512-res2-f60-2VZPxB time: more than 40 minutes because rank 29 uploading delay

In case 1 rank upload fails, it won't hang:

test-uploader-MXcoEp

bigning avatar Jun 06 '24 16:06 bigning