haystack
haystack copied to clipboard
Migrating to use native Pytorch AMP
Related Issue(s): Issue #1512 Issue #1222
Proposed changes: Migrating to Pytorch's native AMP https://pytorch.org/docs/stable/notes/amp_examples.html because it is much easier to use (no additional dependency on apex) and needs fewer code changes to support and it's recommended (https://discuss.pytorch.org/t/torch-cuda-amp-vs-nvidia-apex/74994).
Using the native AMP support in Pytorch requires using torch.cuda.amp.autocast and torch.cuda.amp.GradScaler together. These can easily be "turned off" so no autocasting or scaling occurs by passing the option enabled=False
.
For example, the following code performs a standard training loop because we have passed the option enabled=False
to both autocast
and GradScaler
use_amp = False
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
# Creates a GradScaler once at the beginning of training.
scaler = GradScaler(enabled=use_amp)
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
# Runs the forward pass with autocasting.
with autocast(enable=use_amp):
output = model(input)
loss = loss_fn(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
And similarly it is easy to turn on AMP by passing enabled=True
.
There are breaking changes:
- The input type for
use_amp
in theFARMReader.train
method is now a typebool
instead of typestr
. - Currently this PR deprecates apex and does not try to support Pytorch AMP and Apex AMP.
Pre-flight checklist
- [x] I have read the contributors guidelines
- [x] If this is a code change, I added tests or updated existing ones
- [x] If this is a code change, I updated the docstrings
TODO
- [x] Checked that
FARMReader.train
works withuse_amp=True
(andFalse
) on a single GPU - [x] Test tutorial 9 DPR training in GPU environment with
use_amp
turned on (and off) and withgrad_acc_steps
- [x] Add
use_amp
to trainer state dict so when restarting from a checkpoint AMP is set up as expected - [ ] Test multi-gpu training with AMP
- [ ] Open PR in https://github.com/deepset-ai/haystack-website editing this file explaining AMP
- I could imagine a section on that page where you briefly describe what AMP is, when to use it and how to use it. Perhaps also give readers an idea of how big a trade off in accuracy / speed it is.
First of all, I support removing nvidia apex and adding pytorch amp. 👍 Doing quick research regarding this decision, it's what seems to be the preferred, future-proof way: https://discuss.pytorch.org/t/torch-cuda-amp-vs-nvidia-apex/74994/2 What we should once this PR is ready to be merged is a small benchmark. And we would also need to ensure that the documentation explains when to use this feature, how to use this feature and what to expect from it.
And we would also need to ensure that the documentation explains when to use this feature, how to use this feature and what to expect from it.
Where should this documentation be added?
And we would also need to ensure that the documentation explains when to use this feature, how to use this feature and what to expect from it.
Where should this documentation be added?
@brandenchan Could you please advise here? Thank you! I would say here: https://haystack.deepset.ai/guides/optimization So that would require a separate PR in https://github.com/deepset-ai/haystack-website but maybe @brandenchan can help with that.
@julian-risch Do we happen to have any GPU resources in our CI testing so I could add a test to make sure the changes to use_amp
work as expected?
@julian-risch Do we happen to have any GPU resources in our CI testing so I could add a test to make sure the changes to
use_amp
work as expected?
Unfortunately the CI is CPU only, no. I would recommend an EC2 instance where you could checkout your branch and run the tests while using the GPU.
And we would also need to ensure that the documentation explains when to use this feature, how to use this feature and what to expect from it.
Where should this documentation be added?
@brandenchan Could you please advise here? Thank you! I would say here: https://haystack.deepset.ai/guides/optimization So that would require a separate PR in https://github.com/deepset-ai/haystack-website but maybe @brandenchan can help with that.
Hi yes sure! To make changes to that specific optimisation page, you will want to open a PR that edits this file. I could imagine a section on that page where you briefly describe what AMP is, when to use it and how to use it. Perhaps also give readers an idea of how big a trade off in accuracy / speed it is.
Happy to review it once you've opened the PR! Just let me know if you want any more pointers regarding this process. The readme of the haystack-website repo describes how you can set up a preview environment on your own machine. There is also an action in the PR which will deploy your changes to a staging environment though this takes a bit longer than previewing on your own machine.
Hi @sjrl have you already created a PR regarding documentation? This PR here looks good to me on first sight. It's a bit unfortunate that there are so many unrelated doc string updates or whitespace changes in there but it's not a big problem. Before I go through the code in detail, I would like to try it out a bit and in that context a draft PR regarding documentation would be great. Did you test it out yourself already?
Hi @julian-risch, sorry about the additional doc updates. I was just making those as I was exploring the code. I'll put them into a separate PR next time.
I have had a chance to test this code out for training a FARMReader on a single GPU. I haven't tried the different combinations (e.g. multi-gpu, gradient_clipping, etc.) to make sure all features work as expected.
Also I have not had time yet to make a draft PR for documentation.
Alright, thanks for the quick response. No worries about the doc string updates. It's better to have them here than not having them at all! 😄
I just found some additional docs on using AMP with multiple GPUs.
Sorry for the late review, just realized I forgot to click "finish your review" after I reviewed it..
Hi @sjrl what's the status of this PR? Would you like to get another round of reviews from @agnieszka-m and me?
Hi, @julian-risch thanks for checking in. I don't think there are any new changes since the last review other than the merging of main. I hope to be able to finish the task of checking this on multiple GPUs next week.
Hi @sjrl we are planning a release in the next two weeks. Could this PR maybe make it into the new release? Did you have the chance to test multi-gpu training with AMP? In the todo list there is another open item "For torch.nn.DataParallel it looks like we would need to "apply autocast as part of your model’s forward method to ensure it’s enabled in side threads.""
FYI: we might upgrade to torch 1.13.1 once it's released.
Hi @sjrl we are planning a release in the next two weeks. Could this PR maybe make it into the new release? Did you have the chance to test multi-gpu training with AMP? In the todo list there is another open item "For torch.nn.DataParallel it looks like we would need to "apply autocast as part of your model’s forward method to ensure it’s enabled in side threads.""
Hey @julian-risch. Sorry there was a small miscommunication. I have verified that amp works with torch.nn.DataParallel
, but I have not verified that it works with torch.nn.DistributedDataParallel
. I have updated the checklist in the top message to reflect this.
I haven't had time to test the multi-gpu training with torch.nn.DistributedDataParallel
. I can try to get this done before the next release, but I am fairly busy at the moment so I'd also be happy to receive help here if you have the time.
Hey @julian-risch I first tried to get DistributedDataParallel
to work today by turning on distributed
training (without amp). Right now this option is hardcoded to be off since we are using the default value in the call to initialize_optimizer
https://github.com/deepset-ai/haystack/blob/77cea8b1408699f69dc09420f9f9af3cf5e6f47a/haystack/modeling/model/optimization.py#L71-L79
And even trying to set this to true when using initialize_optimizer
in the training loop https://github.com/deepset-ai/haystack/blob/77cea8b1408699f69dc09420f9f9af3cf5e6f47a/haystack/nodes/reader/farm.py#L266
causes a multiprocessing error. So it looks like we would need to first fix the distributed training feature before confirming that amp works with it as well.
So what's your opinion on the best way forward? I'd say we merge the changes that we have up to now and support just torch.nn.DataParallel
but not torch.nn.DistributedDataParallel
. Investigating the multiprocessing error and supporting torch.nn.DistributedDataParallel
should then become the topic of a separate issue that we can add to the backlog.
@julian-risch Yes, I completely agree.