training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

[Discussion] Future of Kubeflow LLM Trainer V2

Open Electronic-Waste opened this issue 8 months ago • 10 comments

What you would like to be added?

We'll use issue to track our discussion about the future of Kubeflow LLM Trainer V2

Background

Starting from the Nov. 2024, we've been discussing about how to implement Kubeflow LLM Trainer V2 to streamline and simplify the LLM fine-tuning lifecycle on Kubernetes, while offering a simple yet efficient Python API.

The initial discussion ended in Mar. 2025. We finally adopted the plan that based on TorchTune and converted it into KEP-2401 after two-months reviews and started the implementation since then (the tracking issue is https://github.com/kubeflow/trainer/issues/2401).

In July. 2025, we successfully made the first release of Kubeflow Trainer V2, including Kubeflow LLM Trainer V2 as one of its highlights. It supports full fine-tuning of serveral popular open-source LLMs on Kubernetes, though very initial, but functions as a complete module.

Also, I presented a talk on Kubeflow Virtual Symposium 2025 about Kubeflow LLM Trainer V2:

What happened

In the mid of July, TorchTune team announced that they would no longer add new features to TorchTune (https://github.com/pytorch/torchtune/issues/2883), which means that the end of maintainence is around the corner.

We need to find out a new way to support our LLM Trainer, providing users with simple and flexible LLM fine-tuning experience on Kubernetes. And we believe, it's an important features for those data scientists who are not familiar with the complex Kubernetes configurations.

Plan1: Continue KEP-2401

Main Ideas: Continue the implementation of KEP-2401 with any modification

Pros: Simple and easy

Cons: TorchTune is faced with deprecation, and will lose the support of new models and fine-tuning strategies

Plan2: Deprecate TorchTune LLM Trainer

Main Ideas: Deprecate TorchTune LLM Trainer from Kubeflow Trainer V2, and seeks for new low-level runtime for LLM fine-tuning

Pros: Switch to a new framework quickly, getting rid of the impact of TorchTune deprecation

Cons: Need several months to investigate existing frameworks, draft proposals, implement code, test functionality, and write user documentations. It means that users need to wait for several months to a year to use new LLM Trainer

Plan3: Implement Dynamic LLM Trainer Framework

Main Ideas: We do not deprecate TorchTune LLM Trainer immediately, but start to implement a dynamic LLM Trainer Framework which can support multiple backends, such as TorchTune, trl, unsloth, llama-factory, while adding new features and popular model supports to TorchTune LLM Trainer.

Pros: Maintain usability of LLM Trainer, while open to new LLM Fine-tuning backends

Cons: Needs careful design and re-design of current code base (small now)

Why is this needed?

Please let me know which plan do you prefer. Thanks a lot!

Slack thread: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1753353947170879

/cc @kubeflow/wg-training-leads @astefanutti @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @kramaranya @eoinfennessy @szaher @rudeigerc @mahdikhashan

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Electronic-Waste avatar Jul 24 '25 10:07 Electronic-Waste

/remove-label lifecycle/needs-triage /area runtimes /area llm

Electronic-Waste avatar Jul 24 '25 10:07 Electronic-Waste

Just my two cents here but I'd go by which frameworks are the most popular. 🤷

From a quick investigation, that looks like trl.

https://pypistats.org/packages/trl https://pypistats.org/packages/unsloth

It's an imperfect set of data of course because torchtune ends up showing up as smaller than trl.

https://pypistats.org/packages/torchtune

franciscojavierarceo avatar Jul 24 '25 12:07 franciscojavierarceo

@franciscojavierarceo Thanks for the precious feedbacks! It's a nice point and we'll definitely consider trl as one of our supported frameworks.

Electronic-Waste avatar Jul 24 '25 13:07 Electronic-Waste

Providing a copy from my previous threads in Slack to get feedbacks from folks on GitHub.

I've been exploring the integration with Trainer of other fine-tuning frameworks, including llamafactory , unsloth, and Huggingface's trl etc. since the sunsetting of torchtune. I think they works well with torchrun scripts as expected (although not fully experimented in different scenarios).

We chose llamafactory to integrate in our internal LLM platform mainly because it provides some cutting-edge support for other models. Some of our customers may need them. It has done quite well for balancing the gap between professional and non-professional users.

I also think trl is a good choice since it is officially maintained by HuggingFace, which connected to their ecosystem more tightly. It also provides more extensibility to make some customization to the training process seamlessly. Besides, we could also consider the integration of RL fine-tuning since it may also be on our roadmap.

Some references that may be helpful:

rudeigerc avatar Jul 24 '25 14:07 rudeigerc

+1 for Plan 3.

Also, beyond the selected libraries we want to provide in-tree, the Trainer Python SDK should ideally be made extensible so external libraries / frameworks can be used, symmetrically to what the TrainingRuntime API offers on the control plane side.

astefanutti avatar Jul 24 '25 14:07 astefanutti

+1 for Plan 3

Leoauro avatar Jul 24 '25 14:07 Leoauro

Someone questions about the next post-training library in torchtitan: https://github.com/pytorch/torchtitan/issues/1771

Electronic-Waste avatar Oct 02 '25 13:10 Electronic-Waste

Successor of this discussion: https://github.com/kubeflow/trainer/issues/2839

Electronic-Waste avatar Oct 02 '25 14:10 Electronic-Waste

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Dec 31 '25 15:12 github-actions[bot]

/remove-lifecycle stale

andreyvelich avatar Jan 03 '26 20:01 andreyvelich

Hi, Interested in contributing to LLaMA-Factory support . I tested locally on 8GB RAM (CPU-only):

  • SmolLM-135M LoRA: worked, 6.1GB peak
  • Qwen2-0.5B LoRA: OOM Saw that @rudeigerc already uses llamafactory internally. Happy to help with implementation. Questions:
  1. Should I look at #2839 instead of this issue?
  2. Where should I start - docs PR or prototype? I have some PRs in Notebooks and Pipelines already.

Sapthagiri777 avatar Jan 22 '26 11:01 Sapthagiri777

Thank you for your interest. Yes, let's collaborate together in #2839 /close

andreyvelich avatar Jan 22 '26 13:01 andreyvelich

@andreyvelich: Closing this issue.

In response to this:

Thank you for your interest. Yes, let's collaborate together in #2839 /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Jan 22 '26 13:01 google-oss-prow[bot]