[Discussion] Future of Kubeflow LLM Trainer V2
What you would like to be added?
We'll use issue to track our discussion about the future of Kubeflow LLM Trainer V2
Background
Starting from the Nov. 2024, we've been discussing about how to implement Kubeflow LLM Trainer V2 to streamline and simplify the LLM fine-tuning lifecycle on Kubernetes, while offering a simple yet efficient Python API.
The initial discussion ended in Mar. 2025. We finally adopted the plan that based on TorchTune and converted it into KEP-2401 after two-months reviews and started the implementation since then (the tracking issue is https://github.com/kubeflow/trainer/issues/2401).
In July. 2025, we successfully made the first release of Kubeflow Trainer V2, including Kubeflow LLM Trainer V2 as one of its highlights. It supports full fine-tuning of serveral popular open-source LLMs on Kubernetes, though very initial, but functions as a complete module.
Also, I presented a talk on Kubeflow Virtual Symposium 2025 about Kubeflow LLM Trainer V2:
What happened
In the mid of July, TorchTune team announced that they would no longer add new features to TorchTune (https://github.com/pytorch/torchtune/issues/2883), which means that the end of maintainence is around the corner.
We need to find out a new way to support our LLM Trainer, providing users with simple and flexible LLM fine-tuning experience on Kubernetes. And we believe, it's an important features for those data scientists who are not familiar with the complex Kubernetes configurations.
Plan1: Continue KEP-2401
Main Ideas: Continue the implementation of KEP-2401 with any modification
Pros: Simple and easy
Cons: TorchTune is faced with deprecation, and will lose the support of new models and fine-tuning strategies
Plan2: Deprecate TorchTune LLM Trainer
Main Ideas: Deprecate TorchTune LLM Trainer from Kubeflow Trainer V2, and seeks for new low-level runtime for LLM fine-tuning
Pros: Switch to a new framework quickly, getting rid of the impact of TorchTune deprecation
Cons: Need several months to investigate existing frameworks, draft proposals, implement code, test functionality, and write user documentations. It means that users need to wait for several months to a year to use new LLM Trainer
Plan3: Implement Dynamic LLM Trainer Framework
Main Ideas: We do not deprecate TorchTune LLM Trainer immediately, but start to implement a dynamic LLM Trainer Framework which can support multiple backends, such as TorchTune, trl, unsloth, llama-factory, while adding new features and popular model supports to TorchTune LLM Trainer.
Pros: Maintain usability of LLM Trainer, while open to new LLM Fine-tuning backends
Cons: Needs careful design and re-design of current code base (small now)
Why is this needed?
Please let me know which plan do you prefer. Thanks a lot!
Slack thread: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1753353947170879
/cc @kubeflow/wg-training-leads @astefanutti @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @kramaranya @eoinfennessy @szaher @rudeigerc @mahdikhashan
Love this feature?
Give it a 👍 We prioritize the features with most 👍
/remove-label lifecycle/needs-triage /area runtimes /area llm
Just my two cents here but I'd go by which frameworks are the most popular. 🤷
From a quick investigation, that looks like trl.
https://pypistats.org/packages/trl https://pypistats.org/packages/unsloth
It's an imperfect set of data of course because torchtune ends up showing up as smaller than trl.
https://pypistats.org/packages/torchtune
@franciscojavierarceo Thanks for the precious feedbacks! It's a nice point and we'll definitely consider trl as one of our supported frameworks.
Providing a copy from my previous threads in Slack to get feedbacks from folks on GitHub.
I've been exploring the integration with Trainer of other fine-tuning frameworks, including llamafactory , unsloth, and Huggingface's trl etc. since the sunsetting of torchtune. I think they works well with torchrun scripts as expected (although not fully experimented in different scenarios).
We chose llamafactory to integrate in our internal LLM platform mainly because it provides some cutting-edge support for other models. Some of our customers may need them. It has done quite well for balancing the gap between professional and non-professional users.
I also think trl is a good choice since it is officially maintained by HuggingFace, which connected to their ecosystem more tightly. It also provides more extensibility to make some customization to the training process seamlessly. Besides, we could also consider the integration of RL fine-tuning since it may also be on our roadmap.
Some references that may be helpful:
- LLM fine-tuning | LLM Inference Handbook by BentoML
- Open Source RL Libraries for LLMs | Anyscale by Anyscale
+1 for Plan 3.
Also, beyond the selected libraries we want to provide in-tree, the Trainer Python SDK should ideally be made extensible so external libraries / frameworks can be used, symmetrically to what the TrainingRuntime API offers on the control plane side.
+1 for Plan 3
Someone questions about the next post-training library in torchtitan: https://github.com/pytorch/torchtitan/issues/1771
Successor of this discussion: https://github.com/kubeflow/trainer/issues/2839
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
Hi, Interested in contributing to LLaMA-Factory support . I tested locally on 8GB RAM (CPU-only):
- SmolLM-135M LoRA: worked, 6.1GB peak
- Qwen2-0.5B LoRA: OOM Saw that @rudeigerc already uses llamafactory internally. Happy to help with implementation. Questions:
- Should I look at #2839 instead of this issue?
- Where should I start - docs PR or prototype? I have some PRs in Notebooks and Pipelines already.
Thank you for your interest. Yes, let's collaborate together in #2839 /close
@andreyvelich: Closing this issue.
In response to this:
Thank you for your interest. Yes, let's collaborate together in #2839 /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.