tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

Optimizer state is not synchronized across replicas like model state is

Open timgianitsos opened this issue 2 years ago • 8 comments

From DistributedDataParallel docs: "The module... assumes that [gradients] will be modified by the optimizer in all processes in the same way." Note that this is "assumed", not enforced. From https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html : "each process keeps a dedicated replica of the optimizer. Since DDP has already synchronized gradients in the backward pass, all optimizer replicas will operate on the same parameter and gradient values in every iteration, and this is how DDP keeps model replicas in the same state"

Checklist

  • [x] No unnecessary issues are included into this pull request.

timgianitsos avatar Jun 06 '23 05:06 timgianitsos

Hi @timgianitsos!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Jun 06 '23 05:06 facebook-github-bot

Deploy Preview for pytorch-tutorials-preview ready!

Name Link
Latest commit 9082d4b6f1a427b0338d1ad8a2d6e88cf4bbadb3
Latest deploy log https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/647ec92e0fd6490008055080
Deploy Preview https://deploy-preview-2433--pytorch-tutorials-preview.netlify.app/intermediate/fsdp_tutorial
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

netlify[bot] avatar Jun 06 '23 05:06 netlify[bot]

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot avatar Jun 06 '23 06:06 facebook-github-bot

Can you please reference an issue you are fixing in the PR description?

svekars avatar Jun 06 '23 15:06 svekars

It's not technically an issue fix since it is a wording change.

If you'd like, I can file an issue just so that this PR can fix it. Let me know.

On Jun 6, 2023, at 8:26 AM, Svetlana Karslioglu @.***> wrote:

Can you please reference an issue you are fixing in the PR description?

— Reply to this email directly, view it on GitHub https://github.com/pytorch/tutorials/pull/2433#issuecomment-1578982220, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADMIJPQ7SHD25DEGO5FQV7LXJ5D2TANCNFSM6AAAAAAY34T7CU. You are receiving this because you were mentioned.

timgianitsos avatar Jun 06 '23 16:06 timgianitsos

Thanks! I interpret you as making a distinction between optimizer state synchronization at (a) initialization versus (b) after each step, and you argue "yes" for the former and "no" for the latter. But I am arguing that there is no optimizer synchronization in either case.

If the optimizers are incidentally initialized to be the same across processes as you say (which I concede will often be the case) that's only because of a decision the user made (or because of the user's oversight as this is default behavior) and NOT because DDP has any effect on the synchronization. That is, if you just create an optimizer like normal in a function that is sent to multiprocess.spawn, then they will be identical (e.g. opt = RMSprop(ddp.parameters(), **opt_kwargs)). The user could make them different if they said if rank == x: <do something different>.

The way the docs were worded before my edit, it seems to imply that the optimizers are initialized to be identical because DDP enforces this. But if so, then I don't understand the mechanism - the only dependency between DDP and the optimizers is the passing of ddp.parameters(). This is a generator which simply yields each layer as a nn.Parameter. This seems absent of any information that the optimizer could use to tell if the model is being distributed. From this, I conclude that the optimizers on different processes are NOT synced with each other, neither at initialization nor after each step.

timgianitsos avatar Jun 06 '23 18:06 timgianitsos

I agree with you. DDP does not explicitly do anything to enforce any synchronization for optimizers, and the identical-ness is only because the same states are sent to each process. A user might be able to introduce changes to the optimizer in one process and DDP will silently continue (although I haven't encountered this atypical situation myself).

Approving from my end, and ccing PyTorch distributed wizards @mrshenli @rohan-varma in case they have an opinion on this

subramen avatar Jun 14 '23 17:06 subramen

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2433

Note: Links to docs will display an error until the docs builds have been completed.

:x: 1 New Failure

As of commit 9beb1d1b8171a45bdfda3ef323a06c3969126ea9:

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Aug 14 '23 17:08 pytorch-bot[bot]

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions[bot] avatar Sep 26 '24 00:09 github-actions[bot]

This documentation mistake has lingered for a year: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

Is there anything else that needs to be addressed before accepting my revision?

timgianitsos avatar Sep 26 '24 15:09 timgianitsos