Hi, as discussed in https://github.com/pytorch/torchtitan/issues/903. This PR includes features of training a llama model from HF directly using “AutoModelForCausalLM” and loading safetensors (hf weights) in an online sharding manner.

test loading safetensors:pytest test_loading_hf_weights.py Here is my results:
Training: LOG_RANK=7 bash run_train.sh (FSDP 2 - PP 2 - TP 2) Here is my results:

Mar 03 '25 14:03 junjzhang

Hi @junjzhang!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Mar 03 '25 14:03 facebook-github-bot

This PR is super interesting to me! Instead of supporting a ton of individual models, a quick way to enable TorchTitan is to use models loaded from Huggingface and apply the various optimization techniques.

Mar 03 '25 16:03 casper-hansen

This PR is super interesting to me! Instead of supporting a ton of individual models, a quick way to enable TorchTitan is to use models loaded from Huggingface and apply the various optimization techniques.

Yes! The weight loading function in this PR is general enough, the only cost of adapting a new model is to implementing parallelism applying function and maybe patch some forward function to employ pp.

Mar 04 '25 02:03 junjzhang

I see you are almost creating a copy of entire torchtitan under this folder. Instead, could you reuse existing files and functions (via import) as much as possible? E.g. I can tell that hf_weights_utils.py, parallelize/pipeline fns, toml config, and test files cannot be directly reused, for other parts can we reuse, including train.py?

Even for parallelize/pipeline fns, we can have standalone files, but we should depend on functions in torchtitan llama, e.g. https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/parallelize_llama.py, as much as possible.

Reusing codes is a thing that always kept in my mind, I've tried my best.

As for train.py, since the original torchtitan version write every thing in a main function, it's hard to reuse it since I change some line of codes in the main function.
As for parallelize/pipeline fns, it's also hard to reuse original codes due to the same reason, I have to change the parallel plan but it is hard coded in titan's function.
As for Dataset, for the same reason, I have to return extra position_ids for hf's llama. Maybe I could reuse more code for dataset by doing monkey patch.
As for loss, same reason.

I think it would be more grace if titan could refactor these function and extract some common pattern. Then I could reuse codes as much as possible. Anyway, I'll take a look to see if I could reuse more codes here.

Mar 04 '25 08:03 junjzhang

I see you are almost creating a copy of entire torchtitan under this folder. Instead, could you reuse existing files and functions (via import) as much as possible? E.g. I can tell that hf_weights_utils.py, parallelize/pipeline fns, toml config, and test files cannot be directly reused, for other parts can we reuse, including train.py? Even for parallelize/pipeline fns, we can have standalone files, but we should depend on functions in torchtitan llama, e.g. https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/parallelize_llama.py, as much as possible.

Reusing codes is a thing that always kept in my mind, I've tried my best.

As for train.py, since the original torchtitan version write every thing in a main function, it's hard to reuse it since I change some line of codes in the main function.

As for parallelize/pipeline fns, it's also hard to reuse original codes due to the same reason, I have to change the parallel plan but it is hard coded in titan's function.

As for Dataset, for the same reason, I have to return extra position_ids for hf's llama. Maybe I could reuse more code for dataset by doing monkey patch.

As for loss, same reason.

I think it would be more grace if titan could refactor these function and extract some common pattern. Then I could reuse codes as much as possible. Anyway, I'll take a look to see if I could reuse more codes here.

@tianyu-l Hi, as stated before, I've pushed a new commit that tried to reuse Titan's codes in dataset.py and parallelize_llama.py. If further reusing is required, I suppose these Titan's functions should be refactored to take in more arguments or be decomposed into multiple reusable methods.

And I also run training (FSDP2 PP2 TP2) to see the correctness of this commit: Cursor 2025-03-04 16 55 01

Mar 04 '25 08:03 junjzhang

@tianyu-l Hi, I've updated readme in https://github.com/pytorch/torchtitan/pull/919/commits/91838de487f807bacf4a38fb211d1a8133016fca and reply all your comments. Please have a look.

Mar 05 '25 11:03 junjzhang

Can I ask a general question? What is the motivation for this PR? It definitely makes sense to use llama pretrained weights and load into torchtitan to train and use parallelisms.

However, in addition to loading the huggingface weights this PR also uses the huggingface model definition. The huggingface model definition requires patching (to forward), and modifications to torchtitan parallelization code. I have a couple of specific questions:

is there a reason you prefer to use the hf llama3 model definition code instead of the torchtitan llama3 code? (If we could load hf weights into torchtitan llama3 would that be just as good?)
since you are using hf model code and hf model weights, why is there a need for customized save/load state dict features?

Thanks!

Mar 06 '25 01:03 wconstab

@wconstab pp issue fixed and tested with https://github.com/pytorch/torchtitan/pull/919/commits/accfa1f31834372323bbfd14104753ead8905a8d#diff-f3ae151a2c757861e79894d650697214e593188396134823440371457ed71ed3 .

Mar 06 '25 05:03 junjzhang

@tianyu-l CP tested with FSDP2 TP2 CP2. CleanShot 2025-03-06 at 14 39 14@2x

Mar 06 '25 06:03 junjzhang

Is there a plan to deduplicate the code from the main TorciTitan? What's the motivation of duplicating main.py or train()? Is it because of state_dict loading? If so, we can discuss how to make checkpointer supports this feature.

Mar 07 '25 21:03 fegin

Is there a plan to deduplicate the code from the main TorciTitan? What's the motivation of duplicating main.py or train()? Is it because of state_dict loading? If so, we can discuss how to make checkpointer supports this feature.

As mentioned before, I need to revise some lines of code of main() to use AutoModelForCausalLM, make the model compilable, adapt PP input, load state dict, etc. Since main is a super huge function, I cannot reuse it. I guess a better way is decomposing the main function into separate common methods, like build_pp_models, run_forward_and_backward. Refactoring is needed here to make it more general to reuse. I'd like to discuss with you to see how to refactor titan's methods to add hf model training feature with minimal code added.

Mar 08 '25 05:03 junjzhang

@junjzhang Sounds good. I'm refactoring train.py. Let's discuss how can we make it more general. I don't expect HF models or other models can just adopt the original train.py, even after refactor, but hope that we can at least reuse as much as we can. My next step is to land the metrics refactor PR, https://github.com/pytorch/torchtitan/pull/945, and publish the next refactor PR on Mar 10.

Mar 10 '25 06:03 fegin

Hi! Any updates?

May 21 '25 09:05 JaheimLee

@junjzhang Are you still interested in rebasing the PR and resolving feedbacks to get it merged?

We've done some refactor to train.py, and https://github.com/pytorch/torchtitan/pull/1238 is doing a bit more to it, including setting up a forward_backward_step() function.

Jun 03 '25 09:06 tianyu-l

@junjzhang Are you still interested in rebasing the PR and resolving feedbacks to get it merged?

We've done some refactor to train.py, and https://github.com/pytorch/torchtitan/pull/1238 is doing a bit more to it, including setting up a forward_backward_step() function.

will have a look later

Jun 05 '25 10:06 junjzhang

Interested in this support, @junjzhang are you still working on this PR?

Aug 08 '25 22:08 aw632

Interested in this support, @junjzhang are you still working on this PR?

Encountered several issues in scaled production env, may clean my codes and pr later.

Aug 09 '25 02:08 junjzhang

@junjzhang Sry to bother you. Have you followed this solution in your prod? I would like to use something similar in my team to quickly iterate over models, but I'm worried about a possible performance penalty.

Any tip or indication would be welcomed, thank you

Sep 23 '25 16:09 tomiock

@tomiock @aw632 @JaheimLee

FYI we have some ongoing work to support this in general. cc @3outeille to see if you could share progress.

Sep 23 '25 19:09 tianyu-l

Hello, I have a draft PR here: https://github.com/huggingface/torchtitan/pull/1

For now, I made sure llama3 and deepseek_v3 from HF is almost equivalent to its torchtitan counterpart in term of loss and grad_norm (without parallelism yet). I am now making sure that it converges with parallelism involved now

Once this is done. I think we can have a first PR merge. Then I will make sure to handle more HF models out of the box in a followup PR

Llama3 (red TT vs green HF)

Deepseek V3 (red TT vs green HF)

Sep 23 '25 21:09 3outeille

Small updates. I now have 4D parallelism working. However, everytime Pipeline parallelism is combined with other parallelism, the loss is not exactly matching the torchtitan counterpat (but still decreasing properly overtime). On it to check why it happens

Oct 09 '25 12:10 3outeille

some updates here: https://github.com/huggingface/torchtitan/pull/1#issue-3389772013

Oct 21 '25 15:10 3outeille

Referencing: https://github.com/pytorch/torchtitan/pull/2048

Nov 17 '25 10:11 3outeille

[Experimental Feature] Huggingface model training

Action Required

Process