composer
composer copied to clipboard
Add distributed sync during wait_for_workers to avoid timeout for large checkpoints
What does this PR do?
Adds a distributed sync to the RemoteUploaderDownloader.wait_for_workers call so that the run does not NCCL timeout while uploading a large checkpoint at the end of a run.
Manual test:
composer train.py yamls/finetune/mpt-7b_dolly_sft.yaml max_duration=2ba global_train_batch_size=4 device_train_microbatch_size=1 save_folder=oci://mosaicml-internal-checkpoints/daniel/checkpoints/{run_name} eval_subset_num_batches=2 eval_first=False optimizer.name=decoupled_lionw model.pretrained=False dist_timeout=20 loggers.wandb={} model.config_overrides.d_model=256 model.config_overrides.n_layers=4 run_name=test-run
Before this PR, it hits a NCCL timeout during FIT_END. After this PR it finishes uploading the checkpoint, runs final eval, and exits cleanly.
What issue(s) does this change relate to?
Closes CO-2176
Before submitting
- [x] Have you read the contributor guidelines?
- [x] Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
- [x] Did you update any related docs and document your change?
- [x] Did you update any related tests and add any new tests related to your change? (see testing)
- [x] Did you run the tests locally to make sure they pass?
- [x] Did you run
pre-commiton your change? (see thepre-commitsection of prerequisites)