[Possible PR discuss] Will a PR of training HF model be welcomed?
Hi! We are in the process of developing a novel training framework for Reinforcement Learning (RL) following TorchTitan. Recently, we've developed a feature to support direct training from Hugging Face (HF) models and the loading safetensors in online sharded fashion. This may substantially cuts down the cost of adapting a new model. All you have to do is implement the parallelism applying function. Given this, I wonder whether a PR with the relevant code and a training example for training Hugging Face's Llama model is welcomed. I think this addition will be of great benefit to many in the community. By the way, during my testing, I found that the HF Llama model demonstrates competitive TPS when compared to the model implemented in TorchTitan.
Hi @junjzhang - I can only speak my opinion, but generically anything that helps Titan enable RL type training would be of significant interest. We are also opening up a new "experimental" folder with the idea of enabling more contributions to have a home as well ... so that's another angle that may help your PR to land. The first PR landing in there currently also uses HF aspects for reference (see https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/deepseek_v3/attn_mask_utils.py).
Thus, while I don't think anyone can say an unseen PR will 100% be accepted, I can say it would definitely be of interest, and I think it would be worth the effort to post the PR so it can be reviewed/discussed/considered for inclusion.
Thanks very much for opening up the discussion!
Maybe @tianyu-l can weigh in here as well.
Hi @junjzhang - I can only speak my opinion, but generically anything that helps Titan enable RL type training would be of significant interest. We are also opening up a new "experimental" folder with the idea of enabling more contributions to have a home as well ... so that's another angle that may help your PR to land. The first PR landing in there currently also uses HF aspects for reference (see https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/deepseek_v3/attn_mask_utils.py).
Thus, while I don't think anyone can say an unseen PR will 100% be accepted, I can say it would definitely be of interest, and I think it would be worth the effort to post the PR so it can be reviewed/discussed/considered for inclusion. Thanks very much for opening up the discussion! Maybe @tianyu-l can weigh in here as well.
Thanks for replying! I thought I could clean up my code and make a draft pr to experiments dir first!
Hey @junjzhang thanks for proposing! We agree this feature is good to have.
As @lessw2020 suggested, let's create new folder hosting HF training under the experiments folder:
- load HF model weights
- showcase an example of training by "implement the parallelism applying function", and reusing
TrainSpec - support converting weights back to HF formats
Relevant discussions:
- https://github.com/pytorch/torchtitan/issues/420
- https://github.com/pytorch/torchtitan/issues/743
- https://github.com/pytorch/torchtitan/issues/824
Maybe we can work with other people who've shown interests and made offline progresses, on this project. cc: @yzhangcs @neeldani @huyiwen @bkchang
Hey @junjzhang thanks for proposing! We agree this feature is good to have.
As @lessw2020 suggested, let's create new folder hosting HF training under the
experimentsfolder:
- load HF model weights
- showcase an example of training by "implement the parallelism applying function", and reusing
TrainSpec- support converting weights back to HF formats
Relevant discussions:
- Llama models with custom configurations and uploading to Hugging Face #420
- Model init with HuggingFace model #743
- Mitigation to HuggingFace Trainer #824
Maybe we can work with other people who've shown interests and made offline progresses, on this project. cc: @yzhangcs @neeldani @huyiwen @bkchang
I've finished features 1 and 2. And I think you can easily implement feature 3 by reusing PretrainedModel's save_model weights. I'll try to clean up the relative codes and pull a PR this week. BTW, this feature will introduce extra requirements like transformers. How would you expect this to be handled in the experiment dir?
@lessw2020 @tianyu-l Could you review this PR https://github.com/pytorch/torchtitan/pull/919 ?
Hi @junjzhang - yes, just saw it - thanks for the PR will take a look today!
Thanks for the PR. I left some comments.