composer
composer copied to clipboard
[Checkpoint] Fix symlink issue where symlink file uploaded before checkpoint files upload
What does this PR do?
Fix the symlink issues.
How?
[updated]: in the checkpoint saver, on rank-0 which saves the symlink, it all_gather the remote checkpoint file names, and start a new process to check if those remote files finish uploading by calling object_store.get_object_size. It only upload symlink file once all the remote file finish uploading. This way:
- it won't block the training, since it's a separate process
- it almost has no delay to upload the symlink file (it will sleep 30s if remote file is not there yet)
Unit test
Integration test
2-nodes OCI:
save: test-uploader-0Tkv9O autoresume: test-uploader-yac88U
2-nodes mflow:
save: l38bi-full-sweep-train-bb-1-0e-6-5-VMY2Xo load: l38bi-full-sweep-train-bb-1-0e-6-5-TyRcPi
Daily test:
https://github.com/mosaicml/composer/actions/runs/9700144963
composer regression test:
https://github.com/databricks-mosaic/regression-testing/actions/runs/9700161085
Perf test (100 batches with 9 batch save interval. The training time varies because of unstable uploading speed, but just want to make sure test didn't regress training time)
64 gpu test: 77b-bs1024-g512-res2-f60-37gSXK time: 26 minutes 64 GPU baseline: 77b-bs1024-g512-res2-f60-2VZPxB time: more than 40 minutes because rank 29 uploading delay
In case 1 rank upload fails, it won't hang:
test-uploader-MXcoEp