[Experimental Feature] Huggingface model training
Hi, as discussed in https://github.com/pytorch/torchtitan/issues/903. This PR includes features of training a llama model from HF directly using “AutoModelForCausalLM” and loading safetensors (hf weights) in an online sharding manner.
- test loading safetensors:
pytest test_loading_hf_weights.pyHere is my results: - Training:
LOG_RANK=7 bash run_train.sh(FSDP 2 - PP 2 - TP 2) Here is my results:
Hi @junjzhang!
Thank you for your pull request and welcome to our community.
Action Required
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
This PR is super interesting to me! Instead of supporting a ton of individual models, a quick way to enable TorchTitan is to use models loaded from Huggingface and apply the various optimization techniques.
This PR is super interesting to me! Instead of supporting a ton of individual models, a quick way to enable TorchTitan is to use models loaded from Huggingface and apply the various optimization techniques.
Yes! The weight loading function in this PR is general enough, the only cost of adapting a new model is to implementing parallelism applying function and maybe patch some forward function to employ pp.
I see you are almost creating a copy of entire torchtitan under this folder. Instead, could you reuse existing files and functions (via
import) as much as possible? E.g. I can tell thathf_weights_utils.py,parallelize/pipelinefns, toml config, and test files cannot be directly reused, for other parts can we reuse, includingtrain.py?Even for
parallelize/pipelinefns, we can have standalone files, but we should depend on functions in torchtitan llama, e.g. https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/parallelize_llama.py, as much as possible.
Reusing codes is a thing that always kept in my mind, I've tried my best.
- As for
train.py, since the original torchtitan version write every thing in a main function, it's hard to reuse it since I change some line of codes in the main function. - As for
parallelize/pipelinefns, it's also hard to reuse original codes due to the same reason, I have to change the parallel plan but it is hard coded in titan's function. - As for Dataset, for the same reason, I have to return extra
position_idsfor hf's llama. Maybe I could reuse more code for dataset by doing monkey patch. - As for loss, same reason.
I think it would be more grace if titan could refactor these function and extract some common pattern. Then I could reuse codes as much as possible. Anyway, I'll take a look to see if I could reuse more codes here.
I see you are almost creating a copy of entire torchtitan under this folder. Instead, could you reuse existing files and functions (via
import) as much as possible? E.g. I can tell thathf_weights_utils.py,parallelize/pipelinefns, toml config, and test files cannot be directly reused, for other parts can we reuse, includingtrain.py? Even forparallelize/pipelinefns, we can have standalone files, but we should depend on functions in torchtitan llama, e.g. https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/parallelize_llama.py, as much as possible.Reusing codes is a thing that always kept in my mind, I've tried my best.
- As for
train.py, since the original torchtitan version write every thing in a main function, it's hard to reuse it since I change some line of codes in the main function.- As for
parallelize/pipelinefns, it's also hard to reuse original codes due to the same reason, I have to change the parallel plan but it is hard coded in titan's function.- As for Dataset, for the same reason, I have to return extra
position_idsfor hf's llama. Maybe I could reuse more code for dataset by doing monkey patch.- As for loss, same reason.
I think it would be more grace if titan could refactor these function and extract some common pattern. Then I could reuse codes as much as possible. Anyway, I'll take a look to see if I could reuse more codes here.
@tianyu-l Hi, as stated before, I've pushed a new commit that tried to reuse Titan's codes in dataset.py and parallelize_llama.py. If further reusing is required, I suppose these Titan's functions should be refactored to take in more arguments or be decomposed into multiple reusable methods.
And I also run training (FSDP2 PP2 TP2) to see the correctness of this commit:
@tianyu-l Hi, I've updated readme in https://github.com/pytorch/torchtitan/pull/919/commits/91838de487f807bacf4a38fb211d1a8133016fca and reply all your comments. Please have a look.
Can I ask a general question? What is the motivation for this PR? It definitely makes sense to use llama pretrained weights and load into torchtitan to train and use parallelisms.
However, in addition to loading the huggingface weights this PR also uses the huggingface model definition. The huggingface model definition requires patching (to forward), and modifications to torchtitan parallelization code. I have a couple of specific questions:
- is there a reason you prefer to use the hf llama3 model definition code instead of the torchtitan llama3 code? (If we could load hf weights into torchtitan llama3 would that be just as good?)
- since you are using hf model code and hf model weights, why is there a need for customized save/load state dict features?
Thanks!
@wconstab pp issue fixed and tested with https://github.com/pytorch/torchtitan/pull/919/commits/accfa1f31834372323bbfd14104753ead8905a8d#diff-f3ae151a2c757861e79894d650697214e593188396134823440371457ed71ed3 .
@tianyu-l CP tested with FSDP2 TP2 CP2.
Is there a plan to deduplicate the code from the main TorciTitan? What's the motivation of duplicating main.py or train()? Is it because of state_dict loading? If so, we can discuss how to make checkpointer supports this feature.
Is there a plan to deduplicate the code from the main TorciTitan? What's the motivation of duplicating
main.pyortrain()? Is it because ofstate_dictloading? If so, we can discuss how to make checkpointer supports this feature.
As mentioned before, I need to revise some lines of code of main() to use AutoModelForCausalLM, make the model compilable, adapt PP input, load state dict, etc. Since main is a super huge function, I cannot reuse it. I guess a better way is decomposing the main function into separate common methods, like build_pp_models, run_forward_and_backward. Refactoring is needed here to make it more general to reuse. I'd like to discuss with you to see how to refactor titan's methods to add hf model training feature with minimal code added.
@junjzhang Sounds good. I'm refactoring train.py. Let's discuss how can we make it more general. I don't expect HF models or other models can just adopt the original train.py, even after refactor, but hope that we can at least reuse as much as we can. My next step is to land the metrics refactor PR, https://github.com/pytorch/torchtitan/pull/945, and publish the next refactor PR on Mar 10.
Hi! Any updates?
@junjzhang Are you still interested in rebasing the PR and resolving feedbacks to get it merged?
We've done some refactor to train.py, and https://github.com/pytorch/torchtitan/pull/1238 is doing a bit more to it, including setting up a forward_backward_step() function.
@junjzhang Are you still interested in rebasing the PR and resolving feedbacks to get it merged?
We've done some refactor to
train.py, and https://github.com/pytorch/torchtitan/pull/1238 is doing a bit more to it, including setting up aforward_backward_step()function.
will have a look later
Interested in this support, @junjzhang are you still working on this PR?
Interested in this support, @junjzhang are you still working on this PR?
Encountered several issues in scaled production env, may clean my codes and pr later.
@junjzhang Sry to bother you. Have you followed this solution in your prod? I would like to use something similar in my team to quickly iterate over models, but I'm worried about a possible performance penalty.
Any tip or indication would be welcomed, thank you
@tomiock @aw632 @JaheimLee
FYI we have some ongoing work to support this in general. cc @3outeille to see if you could share progress.
Hello, I have a draft PR here: https://github.com/huggingface/torchtitan/pull/1
For now, I made sure llama3 and deepseek_v3 from HF is almost equivalent to its torchtitan counterpart in term of loss and grad_norm (without parallelism yet). I am now making sure that it converges with parallelism involved now
Once this is done. I think we can have a first PR merge. Then I will make sure to handle more HF models out of the box in a followup PR
- Llama3 (red TT vs green HF)
- Deepseek V3 (red TT vs green HF)
Small updates. I now have 4D parallelism working. However, everytime Pipeline parallelism is combined with other parallelism, the loss is not exactly matching the torchtitan counterpat (but still decreasing properly overtime). On it to check why it happens
some updates here: https://github.com/huggingface/torchtitan/pull/1#issue-3389772013
Referencing: https://github.com/pytorch/torchtitan/pull/2048