training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Export Fine-Tuned LLM after Trainer is Complete

Open andreyvelich opened this issue 1 year ago • 10 comments

We discussed here: https://github.com/kubeflow/website/pull/3718#issuecomment-2096619898 that our LLM Trainer doesn't export the fine-tuned model. So user can't re-use that model for inference or other purposes.

We should discuss how user can get the fine-tuned artifact after LLM Trainer is complete. /cc @kubeflow/wg-training-leads @deepanker13

Would be nice to see integration with Kubeflow Model Registry as well. cc @kubeflow/wg-data-leads

andreyvelich avatar May 06 '24 22:05 andreyvelich

Would be nice to see integration with Kubeflow Model Registry as well. cc @kubeflow/wg-data-leads

If there is a tutorial of the part specific to this project that exhibit the metadata we want to capture on Model Registry, I would be very happy to complement that example with indexing those metadata on MR ! 🚀👍

tarilabs avatar May 07 '24 01:05 tarilabs

@andreyvelich I may have misunderstood the initial context of this API because I was under the impression that you could serve the model once fine-tuned. Can you elaborate on this?

So user can't re-use that model for inference or other purposes.

StefanoFioravanzo avatar May 07 '24 06:05 StefanoFioravanzo

@andreyvelich I may have misunderstood the initial context of this API because I was under the impression that you could serve the model once fine-tuned. Can you elaborate on this?

So user can't re-use that model for inference or other purposes.

I think, right now the only way is to use output_dir for model checkpoints. In that case, user can get the model from PVC that we attach to the PyTorchJob. Like in this example: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/language-modeling/train_api_hf_dataset.ipynb Right @johnugeorge @deepanker13 ?

andreyvelich avatar May 07 '24 11:05 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 05 '24 15:08 github-actions[bot]

/remove-lifecycle stale

andreyvelich avatar Aug 05 '24 16:08 andreyvelich

per https://github.com/kubeflow/training-operator/issues/2101#issuecomment-2097204327 is there a tutorial/demo about this, please?

I would be very happy to integrate a demo/blueprint for the documentation, I just need a "seed" to get started on the training operator :) thanks!

tarilabs avatar Aug 05 '24 19:08 tarilabs

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 03 '24 20:11 github-actions[bot]

/remove-lifecycle stale

andreyvelich avatar Nov 04 '24 15:11 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Feb 02 '25 20:02 github-actions[bot]

/remove-lifecycle stale

andreyvelich avatar Feb 03 '25 11:02 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 04 '25 15:05 github-actions[bot]

/remove-lifecycle stale we've been discussing different strategies to address this community need

  • https://github.com/kubeflow/model-registry/issues/891

/lifecycle frozen

tarilabs avatar May 04 '25 15:05 tarilabs

Currently, users can get the fine-tuned model from the PVC: https://www.kubeflow.org/docs/components/trainer/user-guides/builtin-trainer/torchtune/#get-the-fine-tuned-model. /close

andreyvelich avatar Aug 10 '25 22:08 andreyvelich

@andreyvelich: Closing this issue.

In response to this:

Currently, users can get the fine-tuned model from the PVC: https://www.kubeflow.org/docs/components/trainer/user-guides/builtin-trainer/torchtune/#get-the-fine-tuned-model. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Aug 10 '25 22:08 google-oss-prow[bot]