haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Migrating to use native Pytorch AMP

Open sjrl opened this issue 2 years ago • 14 comments

Related Issue(s): Issue #1512 Issue #1222

Proposed changes: Migrating to Pytorch's native AMP https://pytorch.org/docs/stable/notes/amp_examples.html because it is much easier to use (no additional dependency on apex) and needs fewer code changes to support and it's recommended (https://discuss.pytorch.org/t/torch-cuda-amp-vs-nvidia-apex/74994).

Using the native AMP support in Pytorch requires using torch.cuda.amp.autocast and torch.cuda.amp.GradScaler together. These can easily be "turned off" so no autocasting or scaling occurs by passing the option enabled=False.

For example, the following code performs a standard training loop because we have passed the option enabled=False to both autocast and GradScaler

use_amp = False

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

# Creates a GradScaler once at the beginning of training.
scaler = GradScaler(enabled=use_amp)

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with autocast(enable=use_amp):
            output = model(input)
            loss = loss_fn(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

And similarly it is easy to turn on AMP by passing enabled=True.

There are breaking changes:

  1. The input type for use_amp in the FARMReader.train method is now a type bool instead of type str.
  2. Currently this PR deprecates apex and does not try to support Pytorch AMP and Apex AMP.

Pre-flight checklist

  • [x] I have read the contributors guidelines
  • [x] If this is a code change, I added tests or updated existing ones
  • [x] If this is a code change, I updated the docstrings

TODO

  • [x] Checked that FARMReader.train works with use_amp=True (and False) on a single GPU
  • [x] Test tutorial 9 DPR training in GPU environment with use_amp turned on (and off) and with grad_acc_steps
  • [x] Add use_amp to trainer state dict so when restarting from a checkpoint AMP is set up as expected
  • [ ] Test multi-gpu training with AMP
    • For torch.nn.DistributedDataParallel it looks like the existing code should work (Link to docs
    • For torch.nn.DataParallel it looks like we would need to "apply autocast as part of your model’s forward method to ensure it’s enabled in side threads." (Link to docs)
  • [ ] Open PR in https://github.com/deepset-ai/haystack-website editing this file explaining AMP
    • I could imagine a section on that page where you briefly describe what AMP is, when to use it and how to use it. Perhaps also give readers an idea of how big a trade off in accuracy / speed it is.

sjrl avatar Jul 15 '22 12:07 sjrl

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jul 15 '22 12:07 CLAassistant

First of all, I support removing nvidia apex and adding pytorch amp. 👍 Doing quick research regarding this decision, it's what seems to be the preferred, future-proof way: https://discuss.pytorch.org/t/torch-cuda-amp-vs-nvidia-apex/74994/2 What we should once this PR is ready to be merged is a small benchmark. And we would also need to ensure that the documentation explains when to use this feature, how to use this feature and what to expect from it.

julian-risch avatar Jul 19 '22 07:07 julian-risch

And we would also need to ensure that the documentation explains when to use this feature, how to use this feature and what to expect from it.

Where should this documentation be added?

sjrl avatar Jul 19 '22 08:07 sjrl

And we would also need to ensure that the documentation explains when to use this feature, how to use this feature and what to expect from it.

Where should this documentation be added?

@brandenchan Could you please advise here? Thank you! I would say here: https://haystack.deepset.ai/guides/optimization So that would require a separate PR in https://github.com/deepset-ai/haystack-website but maybe @brandenchan can help with that.

julian-risch avatar Jul 21 '22 12:07 julian-risch

@julian-risch Do we happen to have any GPU resources in our CI testing so I could add a test to make sure the changes to use_amp work as expected?

sjrl avatar Jul 26 '22 12:07 sjrl

@julian-risch Do we happen to have any GPU resources in our CI testing so I could add a test to make sure the changes to use_amp work as expected?

Unfortunately the CI is CPU only, no. I would recommend an EC2 instance where you could checkout your branch and run the tests while using the GPU.

julian-risch avatar Jul 26 '22 12:07 julian-risch

And we would also need to ensure that the documentation explains when to use this feature, how to use this feature and what to expect from it.

Where should this documentation be added?

@brandenchan Could you please advise here? Thank you! I would say here: https://haystack.deepset.ai/guides/optimization So that would require a separate PR in https://github.com/deepset-ai/haystack-website but maybe @brandenchan can help with that.

Hi yes sure! To make changes to that specific optimisation page, you will want to open a PR that edits this file. I could imagine a section on that page where you briefly describe what AMP is, when to use it and how to use it. Perhaps also give readers an idea of how big a trade off in accuracy / speed it is.

Happy to review it once you've opened the PR! Just let me know if you want any more pointers regarding this process. The readme of the haystack-website repo describes how you can set up a preview environment on your own machine. There is also an action in the PR which will deploy your changes to a staging environment though this takes a bit longer than previewing on your own machine.

brandenchan avatar Jul 26 '22 12:07 brandenchan

Hi @sjrl have you already created a PR regarding documentation? This PR here looks good to me on first sight. It's a bit unfortunate that there are so many unrelated doc string updates or whitespace changes in there but it's not a big problem. Before I go through the code in detail, I would like to try it out a bit and in that context a draft PR regarding documentation would be great. Did you test it out yourself already?

julian-risch avatar Aug 04 '22 11:08 julian-risch

Hi @julian-risch, sorry about the additional doc updates. I was just making those as I was exploring the code. I'll put them into a separate PR next time.

I have had a chance to test this code out for training a FARMReader on a single GPU. I haven't tried the different combinations (e.g. multi-gpu, gradient_clipping, etc.) to make sure all features work as expected.

Also I have not had time yet to make a draft PR for documentation.

sjrl avatar Aug 04 '22 12:08 sjrl

Alright, thanks for the quick response. No worries about the doc string updates. It's better to have them here than not having them at all! 😄

julian-risch avatar Aug 04 '22 12:08 julian-risch

I just found some additional docs on using AMP with multiple GPUs.

  • For torch.nn.DistributedDataParallel it looks like the existing code should work (Link to docs
  • For torch.nn.DataParallel it looks like we would need to "apply autocast as part of your model’s forward method to ensure it’s enabled in side threads." (Link to docs)

sjrl avatar Aug 08 '22 07:08 sjrl

Sorry for the late review, just realized I forgot to click "finish your review" after I reviewed it..

agnieszka-m avatar Aug 08 '22 08:08 agnieszka-m

Hi @sjrl what's the status of this PR? Would you like to get another round of reviews from @agnieszka-m and me?

julian-risch avatar Aug 25 '22 08:08 julian-risch

Hi, @julian-risch thanks for checking in. I don't think there are any new changes since the last review other than the merging of main. I hope to be able to finish the task of checking this on multiple GPUs next week.

sjrl avatar Aug 25 '22 08:08 sjrl

Hi @sjrl we are planning a release in the next two weeks. Could this PR maybe make it into the new release? Did you have the chance to test multi-gpu training with AMP? In the todo list there is another open item "For torch.nn.DataParallel it looks like we would need to "apply autocast as part of your model’s forward method to ensure it’s enabled in side threads.""

julian-risch avatar Dec 08 '22 08:12 julian-risch

FYI: we might upgrade to torch 1.13.1 once it's released.

julian-risch avatar Dec 08 '22 08:12 julian-risch

Hi @sjrl we are planning a release in the next two weeks. Could this PR maybe make it into the new release? Did you have the chance to test multi-gpu training with AMP? In the todo list there is another open item "For torch.nn.DataParallel it looks like we would need to "apply autocast as part of your model’s forward method to ensure it’s enabled in side threads.""

Hey @julian-risch. Sorry there was a small miscommunication. I have verified that amp works with torch.nn.DataParallel, but I have not verified that it works with torch.nn.DistributedDataParallel. I have updated the checklist in the top message to reflect this.

I haven't had time to test the multi-gpu training with torch.nn.DistributedDataParallel. I can try to get this done before the next release, but I am fairly busy at the moment so I'd also be happy to receive help here if you have the time.

sjrl avatar Dec 08 '22 19:12 sjrl

Hey @julian-risch I first tried to get DistributedDataParallel to work today by turning on distributed training (without amp). Right now this option is hardcoded to be off since we are using the default value in the call to initialize_optimizer https://github.com/deepset-ai/haystack/blob/77cea8b1408699f69dc09420f9f9af3cf5e6f47a/haystack/modeling/model/optimization.py#L71-L79

And even trying to set this to true when using initialize_optimizer in the training loop https://github.com/deepset-ai/haystack/blob/77cea8b1408699f69dc09420f9f9af3cf5e6f47a/haystack/nodes/reader/farm.py#L266

causes a multiprocessing error. So it looks like we would need to first fix the distributed training feature before confirming that amp works with it as well.

sjrl avatar Dec 08 '22 21:12 sjrl

So what's your opinion on the best way forward? I'd say we merge the changes that we have up to now and support just torch.nn.DataParallel but not torch.nn.DistributedDataParallel. Investigating the multiprocessing error and supporting torch.nn.DistributedDataParallel should then become the topic of a separate issue that we can add to the backlog.

julian-risch avatar Jan 04 '23 12:01 julian-risch

@julian-risch Yes, I completely agree.

sjrl avatar Jan 04 '23 12:01 sjrl