What does this PR do?

Fixes #29425. Also please refer to the accompanying PR https://github.com/huggingface/accelerate/pull/2531 which implements an extra control sync_each_batch for GradientAccumulationPlugin. Before these changes, GradientAccumulationPlugin is configured by Trainer with a fixed set of hardcodes. This PR allows the user to set sync_each_batch if memory issues are faced when using FSDP with no_sync.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@muellerzr

Mar 11 '24 14:03 fabianlim

Hello @fabianlim , Thank you for the PR of Accelerate and this for reducing the memory usage with FSDP by forcing gradient synchronization at each step. An overall comment: What if we change the default in Accelerate to not enable no_sync when preparing FSDP model. That way user would not have to pass an extra argument.

@pacman100 thanks for the message. I considered this, but decided not too, as i did some experiments with smaller models like llama7b and saw that turning the flag on or off made very little difference in terms of peak memory consumption. This is probably the case that it was activation dominated.

Thus I concluded that this is use case dependent, and not a good idea to set it to True by default. It wouldnt be so good for the llama7b user to now explicity sync_each_batch=False to improve the throughput. (and its probably more insiduous because the user wouldnt be aware that its turned on, and that the training time could actually be faster at no impact if it were turned off)

Mar 12 '24 05:03 fabianlim

thanks for the message. I considered this, but decided not too, as i did some experiments with smaller models like llama7b and saw that turning the flag on or off made very little difference in terms of peak memory consumption. This is probably the case that it was activation dominated.

Thus I concluded that this is use case dependent, and not a good idea to set it to True by default. It wouldnt be so good for the llama7b user to now explicity sync_each_batch=False to improve the throughput. (and its probably more insiduous because the user wouldnt be aware that its turned on, and that the training time could actually be faster at no impact if it were turned off)

Got it, makes sense and thank you for all the details. Then maybe more documentation on the Accelerate docs and Transformers docs would help when this flag makes a difference. I think the info about negligible peak memory change for 7B model should be added to the Accelerate docs.

Mar 12 '24 05:03 pacman100

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Mar 12 '24 12:03 HuggingFaceDocBuilderDev

Thanks! We're getting closer. Also, no need to force-push your commits, we squash the commit history (and force-push makes it harder for us to track things)

Got it. Yes I mostly have a force-push habit to not bloat my commits. However I did it once to rebase, because I noticed the tests_torch were failing due to daily PR merges.

After my latest changes the tests are failing again, but I shall refrain from rebasing for now.

Mar 13 '24 02:03 fabianlim

Got it, makes sense and thank you for all the details. Then maybe more documentation on the Accelerate docs and Transformers docs would help when this flag makes a difference. I think the info about negligible peak memory change for 7B model should be added to the Accelerate docs.

@pacman100 how about this addition to the accelerate docs. I decided to have the example about llama13b instead of 7b. The PR has already been merged so this will have to be another PR.

Mar 13 '24 06:03 fabianlim

(You may also need to rebase from main for the test failures)

Mar 13 '24 20:03 muellerzr

~> (You may also need to rebase from main for the test failures)~

~@muellerzr should I rebase now or wait till the end until we resolve most of the changes? because if I rebase I will need to force push and that makes it harder to track?~ I pulled main's changes in, but probably one more pull is needed after https://github.com/huggingface/transformers/pull/29647 is merged.

I agree that conditionals are not preferred in tests, and there are other ways like using parameterized, but we need the rework the self.assert lines to check conditionally (e.g., if grad_accum_kwargs is not specified, etc etc.). Also we will need some conditionals on how to populate the dicts going into the @parameterized decorator. So it does come with its own complexities as well.

Yes conditionals generally not best practice, but in this case I feel the usage it quite minor and feel that it does not affect readability very much.

I have incorporated your suggestions to follow the FSDP style of require_fsdp. In an attempt to be more consistent i have also moved GRAD_ACCUM_KWARGS_VERSION_AVAILABLE to the top of the file. I have tried to put comments. Note there is one more use of require_accelerate in the same test_trainer.py file that I cannot replace, because it changes the logic.

Mar 13 '24 23:03 fabianlim

@fabianlim you'll need to run make style; make quality; to fix the style tests

Mar 14 '24 11:03 muellerzr

its been a pleasure @muellerzr! i have just pulled the latest changes from main!

Mar 14 '24 12:03 fabianlim

Hi @amyeroberts looking forward to your review! If there is anything I can address please feel free to let me know. FYI: @muellerzr

Mar 18 '24 16:03 fabianlim

@amyeroberts I pulled main again have updated the code to conform to @muellerzr's changes in https://github.com/huggingface/transformers/pull/29779

Mar 22 '24 04:03 fabianlim

Allow GradientAccumulationPlugin to be configured from AcceleratorConfig

What does this PR do?

Before submitting

Who can review?