training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

KEP-2170: Design Trainer for the LLM Runtimes

Open andreyvelich opened this issue 1 year ago • 6 comments

As part of Kubeflow Training V2 work, we should design and implement custom Trainer to fine-tune LLMs that we are planning to support via TrainingRuntimes in Kubeflow upstream.

We should discuss whether we should use native PyTorch APIs or HuggingFace Transformers in the LLM Trainer implementation.

The Trainer should allow users to configure LoRA, QLoRA, FSDP, and other important configurations.

Useful resources:

  • LLM Trainer implementation in the Kubeflow Training V1
  • Recipes to fine-tune Llama models
  • Updated Llama recipes that we use to fine-tune Llama 3.2 - 1B: https://github.com/andreyvelich/llama-recipes/tree/kubeflow-llama

Part of: https://github.com/kubeflow/training-operator/issues/2170

Design Doc

Initial design doc from @Electronic-Waste where we can brainstorm ideas: https://docs.google.com/document/d/1a4xWGVWZo43QKv8tIomoK_XHzBMC_byXBnDb0104htQ/edit?tab=t.0

cc @saileshd1402 @deepanker13 @kubeflow/wg-training-leads

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich avatar Nov 05 '24 21:11 andreyvelich

/assign @saileshd1402

We are experimenting with some PyTorch-native and Transformers APIs to design this Trainer.

andreyvelich avatar Nov 08 '24 23:11 andreyvelich

@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @saileshd1402

We are experimenting with some PyTorch-native and Transformers APIs to design this Trainer.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Nov 08 '24 23:11 google-oss-prow[bot]

/assign

saileshd1402 avatar Nov 08 '24 23:11 saileshd1402

What we concern most is that 👀:

  • Whether we need to support multiple launchers for LLM fine-tuning, such as torchrun and accelerate.

  • Which framework we should support in the torchrun launchers.

    • Just adopt torchtune API by reusing recipes and configs.
    • Or support multiple backends, e,g, HuggingFace Transformers, torchtune API, Nvidia NeMo.

Please refer to Kubeflow Training V2 LLM Trainer Design Doc for design details:)

/cc @kubeflow/wg-training-leads @doris-xm @astefanutti @helenxie-bit @tariq-hasan @akshaychitneni @varshaprasad96 @tarekabouzeid @tarat44 @Syulin7 @sandipanpanda @mszadkow @akhilsaivenkata @tico88612 @danielsuh05 @kannon92 @gavrissh @saileshd1402 @ckyuto @Veer0x1 @astefanutti @oksanabaza @YosiElias @sophie0730 @seanlaii @Bobbins228 @droctothorpe @lowang-bh @mimowo @hkiiita @ChristopheBrown @harshithbelagur @marcmaliar @deepanker13

Electronic-Waste avatar Jan 09 '25 10:01 Electronic-Waste

Useful resources:

This link is broken, the whole python directory hierarchy is missing from the given repo.

eero-t avatar Feb 14 '25 10:02 eero-t

Good catch @eero-t, I've updated the description

andreyvelich avatar Feb 14 '25 11:02 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 15 '25 15:05 github-actions[bot]

This is tracked here: https://github.com/kubeflow/trainer/issues/2401

andreyvelich avatar May 15 '25 17:05 andreyvelich