torchtitan [Possible PR discuss] Will a PR of training HF model be welcomed?

Hi! We are in the process of developing a novel training framework for Reinforcement Learning (RL) following TorchTitan. Recently, we've developed a feature to support direct training from Hugging Face (HF) models and the loading safetensors in online sharded fashion. This may substantially cuts down the cost of adapting a new model. All you have to do is implement the parallelism applying function. Given this, I wonder whether a PR with the relevant code and a training example for training Hugging Face's Llama model is welcomed. I think this addition will be of great benefit to many in the community. By the way, during my testing, I found that the HF Llama model demonstrates competitive TPS when compared to the model implemented in TorchTitan.

Feb 28 '25 03:02 junjzhang

Hi @junjzhang - I can only speak my opinion, but generically anything that helps Titan enable RL type training would be of significant interest. We are also opening up a new "experimental" folder with the idea of enabling more contributions to have a home as well ... so that's another angle that may help your PR to land. The first PR landing in there currently also uses HF aspects for reference (see https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/deepseek_v3/attn_mask_utils.py).

Thus, while I don't think anyone can say an unseen PR will 100% be accepted, I can say it would definitely be of interest, and I think it would be worth the effort to post the PR so it can be reviewed/discussed/considered for inclusion. Thanks very much for opening up the discussion!
Maybe @tianyu-l can weigh in here as well.

Feb 28 '25 06:02 lessw2020

Hi @junjzhang - I can only speak my opinion, but generically anything that helps Titan enable RL type training would be of significant interest. We are also opening up a new "experimental" folder with the idea of enabling more contributions to have a home as well ... so that's another angle that may help your PR to land. The first PR landing in there currently also uses HF aspects for reference (see https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/deepseek_v3/attn_mask_utils.py).

Thus, while I don't think anyone can say an unseen PR will 100% be accepted, I can say it would definitely be of interest, and I think it would be worth the effort to post the PR so it can be reviewed/discussed/considered for inclusion. Thanks very much for opening up the discussion! Maybe @tianyu-l can weigh in here as well.

Thanks for replying! I thought I could clean up my code and make a draft pr to experiments dir first!

Feb 28 '25 07:02 junjzhang

Hey @junjzhang thanks for proposing! We agree this feature is good to have.

As @lessw2020 suggested, let's create new folder hosting HF training under the experiments folder:

load HF model weights
showcase an example of training by "implement the parallelism applying function", and reusing TrainSpec
support converting weights back to HF formats

Relevant discussions:

https://github.com/pytorch/torchtitan/issues/420
https://github.com/pytorch/torchtitan/issues/743
https://github.com/pytorch/torchtitan/issues/824

Maybe we can work with other people who've shown interests and made offline progresses, on this project. cc: @yzhangcs @neeldani @huyiwen @bkchang

Mar 02 '25 21:03 tianyu-l

Hey @junjzhang thanks for proposing! We agree this feature is good to have.

As @lessw2020 suggested, let's create new folder hosting HF training under the experiments folder:

load HF model weights

showcase an example of training by "implement the parallelism applying function", and reusing TrainSpec

support converting weights back to HF formats

Relevant discussions:

Llama models with custom configurations and uploading to Hugging Face #420

Model init with HuggingFace model #743

Mitigation to HuggingFace Trainer #824

Maybe we can work with other people who've shown interests and made offline progresses, on this project. cc: @yzhangcs @neeldani @huyiwen @bkchang

I've finished features 1 and 2. And I think you can easily implement feature 3 by reusing PretrainedModel's save_model weights. I'll try to clean up the relative codes and pull a PR this week. BTW, this feature will introduce extra requirements like transformers. How would you expect this to be handled in the experiment dir?

Mar 03 '25 02:03 junjzhang

@lessw2020 @tianyu-l Could you review this PR https://github.com/pytorch/torchtitan/pull/919 ?

Mar 03 '25 14:03 junjzhang

Hi @junjzhang - yes, just saw it - thanks for the PR will take a look today!

Mar 03 '25 17:03 lessw2020

Thanks for the PR. I left some comments.

Mar 04 '25 08:03 tianyu-l