flyte icon indicating copy to clipboard operation
flyte copied to clipboard

Flytekit checkpoint improvement- pytorch

Open kumare3 opened this issue 1 year ago • 1 comments

Motivation: Why do you think this is important?

When using elastic we can greatly improve checkpointing performance using https://pytorch.org/blog/reducing-checkpointing-times/

Goal: What should the final outcome look like, ideally?

Checkpoints are faster

Describe alternatives you've considered

Na

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

kumare3 avatar Jun 19 '24 04:06 kumare3

To improve checkpointing performance in Flytekit for PyTorch, leveraging asynchronous checkpointing as described in the PyTorch blog is a viable approach. This method reduces the downtime for training due to checkpointing by moving the final checkpointing process off the critical path to CPU threads, allowing GPU training to continue.

Would you like more details on how to integrate this into Flytekit?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

runllm[bot] avatar Jun 19 '24 04:06 runllm[bot]

"Hello 👋, this issue has been inactive for over 90 days. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏"

github-actions[bot] avatar May 18 '25 00:05 github-actions[bot]

Hello 👋, this issue has been inactive for over 90 days and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar May 26 '25 00:05 github-actions[bot]