transformers icon indicating copy to clipboard operation
transformers copied to clipboard

fix prompt tunning + deepspeed zero3 + checkpoint_saving hang issue

Open sywangyi opened this issue 1 year ago • 6 comments

What does this PR do?

Fixes # (issue)

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

sywangyi avatar Apr 01 '24 10:04 sywangyi

Hi @younesbelkada. When I use deepspeed zero3 and prompt tuning to finetune large model. find training hang after saving checkpoint. the prompt-tuning has prompt_encoder forward in model save_pretrained. only rank0 call the prompt_encoder forward in deepspeed zero3 is not enough. all the rank should call prompt_encoder forward.

sywangyi avatar Apr 01 '24 10:04 sywangyi

cc @pacman100 for deepspeed!

ArthurZucker avatar Apr 02 '24 08:04 ArthurZucker

yes, I think should_save should be move to _save. instead of controlling if _save is called like _save_tpu, WDYT?

sywangyi avatar Apr 02 '24 11:04 sywangyi

any thought about my proposal? I think should_save should be move to _save. instead of controlling if _save is called like _save_tpu. @pacman100

sywangyi avatar Apr 15 '24 12:04 sywangyi

@pacman100 , @ArthurZucker what's your suggestion to move on for this issue? any comments on Yi's proposal?

yao-matrix avatar Apr 18 '24 06:04 yao-matrix

how about this? @pacman100

sywangyi avatar May 07 '24 03:05 sywangyi

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 31 '24 08:05 github-actions[bot]

cc @muellerzr if you can take this review over! 🤗

ArthurZucker avatar Jun 05 '24 13:06 ArthurZucker

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

the prompt-tuning has prompt_encoder forward in model save_pretrained. only rank0 call the prompt_encoder forward in deepspeed zero3 is not enough. all the rank should call prompt_encoder forward.

thanks for the review! the root cause of this issue is that the prompt-tuning has prompt_encoder forward in model save_pretrained. only rank0 call the prompt_encoder forward in deepspeed zero3 is not enough. all the rank should call prompt_encoder forward. or else it will hang... so save_pretrained in _save should be called in all the rank. however. self.args.should_save is only true when local_process_index=0. so I move it to _save.

sywangyi avatar Jun 06 '24 05:06 sywangyi

Again though: this is deepspeed specific based on model sharding/splitting, so we should only modify the code for deepspeed specifically, unless there is a reason not to. (Aka, deepspeed and fsdp specific I imagine).

Am I incorrect in this understanding?

muellerzr avatar Jun 06 '24 13:06 muellerzr

Again though: this is deepspeed specific based on model sharding/splitting, so we should only modify the code for deepspeed specifically, unless there is a reason not to. (Aka, deepspeed and fsdp specific I imagine).

Am I incorrect in this understanding?

yes, deepspeed zero3 and FSDP specific

sywangyi avatar Jun 07 '24 10:06 sywangyi

Great, let's adjust the logic for that case specifically then, and leave the already existing base logic in-tact then

muellerzr avatar Jun 07 '24 11:06 muellerzr

@muellerzr I have updated the PR, could you revisit it?

sywangyi avatar Jun 13 '24 14:06 sywangyi

Thanks for adding this, but having this fix indicates that there's likely something wrong in how we control our saving logic more generally.

Having to have lots of should_save checks within the _save method is odd: the _save method should only be called if should_save is true OR, at worst, the _save method should be directly exited at the start of the call, we shouldn't have to have the whole method's logic conditioned on this flag

the main objective is that the save_pretrained in _save should be called in every rank in FSDP, DEEPSPEED+prompt tuning case. so, '_save' method could not be directly exited at the start of the call or controlled by should_save. @amyeroberts I just fix the comment.

sywangyi avatar Jun 14 '24 05:06 sywangyi

@sywangyi Thanks for the explanation. I understand the intended logic. My previous comment still stands: we shouldn't need to condition so much of the _save logic on the should_save flag, it's OK to pass in is_main_process, but it doesn't really make sense for saving e.g. the tokenizer. Instead, I would look at:

  • How we define the main process - we might need a different flag e.g. self._is_main_process
  • Update should_save to be true for all ranks in FSDP, DEEPSPEED+prompt tuning case

cc @muellerzr Does this seem reasonable?

_save_tpu has the similar logic. use should_save flag internal.

sywangyi avatar Jul 12 '24 06:07 sywangyi

also. in save_pretrained (https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L2503 or https://github.com/huggingface/peft/blob/main/src/peft/peft_model.py#L294) is_main_process is equal to should_save,should_save is controlled by https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L2322. not equal to process_index == 0. If you see _save_tpu logic, it's what I would like to imitate. calling logic is https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3420. function logic is like https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3460. should_save is put in the save function internally.

sywangyi avatar Jul 23 '24 06:07 sywangyi

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 25 '24 08:08 github-actions[bot]

@amyeroberts , do you have any further comments on this? Thx.

yao-matrix avatar Sep 09 '24 00:09 yao-matrix

@yao-matrix Beyond my comment here and @muellerzr's comment here, I don't have any else to add, but happy to clarify my point in any way or help iterate on a solution

amyeroberts avatar Sep 09 '24 18:09 amyeroberts