flytekit icon indicating copy to clipboard operation
flytekit copied to clipboard

Support torch async dist checkpoint

Open novahow opened this issue 7 months ago • 0 comments

Tracking issue

Closes flyteorg/flyte#5488

Why are the changes needed?

currently I think we use torch.save and upload it to s3. As models get larger, sync saving isn't time-efficient.

What changes were proposed in this pull request?

We use futures to put it in another thread so that user can continue training. If user saves again, we wait till the prev save& upload to finish and submit the next save+upload request.

How was this patch tested?

n/a. Tried to run on local computer, but my computer was too low-end and crashed.

Setup process

Screenshots

Check all the applicable boxes

  • [ ] I updated the documentation accordingly.
  • [ ] All new and existing tests passed.
  • [ ] All commits are signed-off.

Related PRs

Docs link

novahow avatar Jul 25 '24 22:07 novahow