training-operator Export Fine-Tuned LLM after Trainer is Complete

We discussed here: https://github.com/kubeflow/website/pull/3718#issuecomment-2096619898 that our LLM Trainer doesn't export the fine-tuned model. So user can't re-use that model for inference or other purposes.

We should discuss how user can get the fine-tuned artifact after LLM Trainer is complete. /cc @kubeflow/wg-training-leads @deepanker13

Would be nice to see integration with Kubeflow Model Registry as well. cc @kubeflow/wg-data-leads

May 06 '24 22:05 andreyvelich

Would be nice to see integration with Kubeflow Model Registry as well. cc @kubeflow/wg-data-leads

If there is a tutorial of the part specific to this project that exhibit the metadata we want to capture on Model Registry, I would be very happy to complement that example with indexing those metadata on MR ! 🚀👍

May 07 '24 01:05 tarilabs

@andreyvelich I may have misunderstood the initial context of this API because I was under the impression that you could serve the model once fine-tuned. Can you elaborate on this?

So user can't re-use that model for inference or other purposes.

May 07 '24 06:05 StefanoFioravanzo

@andreyvelich I may have misunderstood the initial context of this API because I was under the impression that you could serve the model once fine-tuned. Can you elaborate on this?

So user can't re-use that model for inference or other purposes.

I think, right now the only way is to use output_dir for model checkpoints. In that case, user can get the model from PVC that we attach to the PyTorchJob. Like in this example: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/language-modeling/train_api_hf_dataset.ipynb Right @johnugeorge @deepanker13 ?

May 07 '24 11:05 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 05 '24 15:08 github-actions[bot]

/remove-lifecycle stale

Aug 05 '24 16:08 andreyvelich

per https://github.com/kubeflow/training-operator/issues/2101#issuecomment-2097204327 is there a tutorial/demo about this, please?

I would be very happy to integrate a demo/blueprint for the documentation, I just need a "seed" to get started on the training operator :) thanks!

Aug 05 '24 19:08 tarilabs

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 03 '24 20:11 github-actions[bot]

/remove-lifecycle stale

Nov 04 '24 15:11 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Feb 02 '25 20:02 github-actions[bot]

/remove-lifecycle stale

Feb 03 '25 11:02 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

May 04 '25 15:05 github-actions[bot]

/remove-lifecycle stale we've been discussing different strategies to address this community need

https://github.com/kubeflow/model-registry/issues/891

/lifecycle frozen

May 04 '25 15:05 tarilabs

Currently, users can get the fine-tuned model from the PVC: https://www.kubeflow.org/docs/components/trainer/user-guides/builtin-trainer/torchtune/#get-the-fine-tuned-model. /close

Aug 10 '25 22:08 andreyvelich

@andreyvelich: Closing this issue.

In response to this:

Currently, users can get the fine-tuned model from the PVC: https://www.kubeflow.org/docs/components/trainer/user-guides/builtin-trainer/torchtune/#get-the-fine-tuned-model. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 10 '25 22:08 google-oss-prow[bot]