composer icon indicating copy to clipboard operation
composer copied to clipboard

Add distributed sync during wait_for_workers to avoid timeout for large checkpoints

Open dakinggg opened this issue 2 years ago • 0 comments
trafficstars

What does this PR do?

Adds a distributed sync to the RemoteUploaderDownloader.wait_for_workers call so that the run does not NCCL timeout while uploading a large checkpoint at the end of a run.

Manual test: composer train.py yamls/finetune/mpt-7b_dolly_sft.yaml max_duration=2ba global_train_batch_size=4 device_train_microbatch_size=1 save_folder=oci://mosaicml-internal-checkpoints/daniel/checkpoints/{run_name} eval_subset_num_batches=2 eval_first=False optimizer.name=decoupled_lionw model.pretrained=False dist_timeout=20 loggers.wandb={} model.config_overrides.d_model=256 model.config_overrides.n_layers=4 run_name=test-run Before this PR, it hits a NCCL timeout during FIT_END. After this PR it finishes uploading the checkpoint, runs final eval, and exits cleanly.

What issue(s) does this change relate to?

Closes CO-2176

Before submitting

  • [x] Have you read the contributor guidelines?
  • [x] Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • [x] Did you update any related docs and document your change?
  • [x] Did you update any related tests and add any new tests related to your change? (see testing)
  • [x] Did you run the tests locally to make sure they pass?
  • [x] Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

dakinggg avatar Jul 15 '23 08:07 dakinggg