flytekit
flytekit copied to clipboard
Support torch async dist checkpoint
Tracking issue
Closes flyteorg/flyte#5488
Why are the changes needed?
currently I think we use torch.save and upload it to s3. As models get larger, sync saving isn't time-efficient.
What changes were proposed in this pull request?
We use futures to put it in another thread so that user can continue training. If user saves again, we wait till the prev save& upload to finish and submit the next save+upload request.
How was this patch tested?
n/a. Tried to run on local computer, but my computer was too low-end and crashed.
Setup process
Screenshots
Check all the applicable boxes
- [ ] I updated the documentation accordingly.
- [ ] All new and existing tests passed.
- [ ] All commits are signed-off.