transformers
transformers copied to clipboard
fix prompt tunning + deepspeed zero3 + checkpoint_saving hang issue
What does this PR do?
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Hi @younesbelkada. When I use deepspeed zero3 and prompt tuning to finetune large model. find training hang after saving checkpoint. the prompt-tuning has prompt_encoder forward in model save_pretrained. only rank0 call the prompt_encoder forward in deepspeed zero3 is not enough. all the rank should call prompt_encoder forward.
cc @pacman100 for deepspeed!
yes, I think should_save should be move to _save. instead of controlling if _save is called like _save_tpu, WDYT?
any thought about my proposal? I think should_save should be move to _save. instead of controlling if _save is called like _save_tpu. @pacman100
@pacman100 , @ArthurZucker what's your suggestion to move on for this issue? any comments on Yi's proposal?
how about this? @pacman100
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
cc @muellerzr if you can take this review over! 🤗
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
the prompt-tuning has prompt_encoder forward in model save_pretrained. only rank0 call the prompt_encoder forward in deepspeed zero3 is not enough. all the rank should call prompt_encoder forward.
thanks for the review! the root cause of this issue is that the prompt-tuning has prompt_encoder forward in model save_pretrained. only rank0 call the prompt_encoder forward in deepspeed zero3 is not enough. all the rank should call prompt_encoder forward. or else it will hang... so save_pretrained in _save should be called in all the rank. however. self.args.should_save is only true when local_process_index=0. so I move it to _save.
Again though: this is deepspeed specific based on model sharding/splitting, so we should only modify the code for deepspeed specifically, unless there is a reason not to. (Aka, deepspeed and fsdp specific I imagine).
Am I incorrect in this understanding?
Again though: this is deepspeed specific based on model sharding/splitting, so we should only modify the code for deepspeed specifically, unless there is a reason not to. (Aka, deepspeed and fsdp specific I imagine).
Am I incorrect in this understanding?
yes, deepspeed zero3 and FSDP specific
Great, let's adjust the logic for that case specifically then, and leave the already existing base logic in-tact then
@muellerzr I have updated the PR, could you revisit it?
Thanks for adding this, but having this fix indicates that there's likely something wrong in how we control our saving logic more generally.
Having to have lots of
should_savechecks within the_savemethod is odd: the_savemethod should only be called ifshould_saveis true OR, at worst, the_savemethod should be directly exited at the start of the call, we shouldn't have to have the whole method's logic conditioned on this flag
the main objective is that the save_pretrained in _save should be called in every rank in FSDP, DEEPSPEED+prompt tuning case. so, '_save' method could not be directly exited at the start of the call or controlled by should_save.
@amyeroberts I just fix the comment.
@sywangyi Thanks for the explanation. I understand the intended logic. My previous comment still stands: we shouldn't need to condition so much of the
_savelogic on theshould_saveflag, it's OK to pass inis_main_process, but it doesn't really make sense for saving e.g. the tokenizer. Instead, I would look at:
- How we define the main process - we might need a different flag e.g.
self._is_main_process- Update
should_saveto be true for all ranks in FSDP, DEEPSPEED+prompt tuning casecc @muellerzr Does this seem reasonable?
_save_tpu has the similar logic. use should_save flag internal.
also. in save_pretrained (https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L2503 or https://github.com/huggingface/peft/blob/main/src/peft/peft_model.py#L294) is_main_process is equal to should_save,should_save is controlled by https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L2322. not equal to process_index == 0. If you see _save_tpu logic, it's what I would like to imitate. calling logic is https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3420.
function logic is like https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3460. should_save is put in the save function internally.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@amyeroberts , do you have any further comments on this? Thx.
@yao-matrix Beyond my comment here and @muellerzr's comment here, I don't have any else to add, but happy to clarify my point in any way or help iterate on a solution