accelerate
accelerate copied to clipboard
A BUG? when performing gradient accumulation
System Info
- `Accelerate` version: 0.16.0
- Platform: Linux-4.14.105-1-tlinux3-0013-x86_64-with-glibc2.2.5
- Python version: 3.8.12
- Numpy version: 1.21.4
- PyTorch version (GPU?): 1.10.0 (True)
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
step = 0
while True:
for batch in training_dataloader:
step += 1
with accelerator.accumulate(model):
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
if step % 100 == 0:
for batch in eval_dataloader:
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
Expected behavior
As seen, we shuffle from training and evaluating (This is common as we monitor its performance on dev set while training the model ).
but the scheduler's step will be wrong
It seems the forward steps in evaluating also account for gradient accumulation.
I think I found the reason:
The loop over eval_dataloader make self.gradient_state.end_of_dataloader
to be True
Then self.gradient_state.sync_gradients
is always set to True
according to
https://github.com/huggingface/accelerate/blob/978dfc38ea91aae645d780d393d5bc0e33ac8306/src/accelerate/accelerator.py#L627
Do you feel this is a bug?
cc @muellerzr
I would like to share my current solution to this problem.
step = 0
while True:
for batch in training_dataloader:
step += 1
with accelerator.accumulate(model):
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
if step % 100 == 0:
######## we save the state of the outer dataloader
end_of_train_dataloader = accelerator.gradient_state.end_of_dataloader
########
for batch in eval_dataloader:
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
######## we restore the state of the outer dataloader
accelerator.gradient_state._set_end_of_dataloader(end_of_train_dataloader)
########
Only two lines of added code: end_of_train_dataloader = accelerator.gradient_state.end_of_dataloader
and accelerator.gradient_state._set_end_of_dataloader(end_of_train_dataloader)
.
Let me know if you like this solution. I am happy to make a PR (add the two lines into the __iter__
function of accelerator's dataloader).
@jcyk correct me if I'm wrong here but is there not a bug in the code, and model.eval()
and/or torch.no_grad
needs to be done so gradients don't get calculated on the eval dataloader?
Otherwise I think a decent solution may be something like:
with accelerator.disable_gradient_accumulation():
...
Which should only be used in tandem in situations like this specifically. Once entered it maintains the original state of end_of_dataloader
and once exited it will reset it.
This is also similar to https://github.com/huggingface/accelerate/issues/960 I believe, so the proposed solution can be done if we agree on it @sgugger and @jcyk :)
@muellerzr
I think the problem is that when we finish the for batch in eval_dataloader
loop. the state of end_of_dataloader
is set to True. but note that we are still in the for batch in training_dataloader
loop. And this will make with accelerator.accumulate(model)
not work as we expetced.
I don't think with accelerator.disable_gradient_accumulation()
is good enough. because, as you can see, the eval code is not in the with accelerator.accumulate(model)
@jcyk exactly, hence the other solution noted there about pause
and resume
, which discusses your exact issue.
I.e.:
dl1, dl2 = accelerator.prepare(dl1, dl2)
for i,batch in enumerate(dl1):
...
if i == 5:
accelerator.gradient_state.pause()
for batch in dl2:
....
accelerator.gradient_state.resume()
...
Hi @jcyk, apologies for the wait on this. Could you try again by installing accelerate with pip install git+https://github.com/huggingface/accelerate@dataloader-multistate
?
You shouldn't need the if/else there, just run the code as normal without any modification, and it should work as you expect. Thanks!
@muellerzr I have just come across the same issue.
I ran pip install git+https://github.com/huggingface/accelerate.git --upgrade
but the issue remains the same. Had to run:
end_of_train_dataloader = accelerator.gradient_state.end_of_dataloader
# evaluate...
accelerator.gradient_state._set_end_of_dataloader(end_of_train_dataloader)
like @jcyk suggested which fixed the issue for me.
I love accelerate, it is awesome, but I do have one suggestion - perhaps updating the examples more frequently will allow such bugs to be discovered earlier. For example the with accelerator.accumulate(model):
is absent from the examples I saw (e.g. complete_nlp_example).
Thanks!
@yuvalkirstain can you provide a reproducer for me? When I tested this earlier it worked OOTB now with Gradient accumulation automatically. And make sure you are running version 0.18.0dev, or do pip install accelerate -U
to ensure you're running 0.17.1
https://github.com/huggingface/accelerate/pull/1162 This doesn't seem to be enough to solve the problem. When the second dataloader finishes, it removes it and sets the state to "True", which affects the external state.
@Ethan-yt can you provide a reproducer for me to test with? Thanks!
@Ethan-yt can you provide a reproducer for me to test with? Thanks!
Sure. Here is a simple code for this bug:
from torch.utils.data import DataLoader
from accelerate.accelerator import Accelerator
from accelerate.test_utils import RegressionDataset
accelerator = Accelerator()
train_dataset = RegressionDataset(length=80)
train_dataloader = DataLoader(train_dataset, batch_size=16)
eval_dataset = RegressionDataset(length=80)
eval_dataloader = DataLoader(eval_dataset, batch_size=16)
train_dataloader, eval_dataloader = accelerator.prepare(train_dataloader, eval_dataloader)
for train_step, _ in enumerate(train_dataloader):
print(f'train step: {train_step}')
print(f'is end: {accelerator.gradient_state.end_of_dataloader}')
# training...
if train_step % 2 == 1:
# eval...
for eval_step, _ in enumerate(eval_dataloader):
print(f'eval step: {eval_step}')
print(f'is end: {accelerator.gradient_state.end_of_dataloader}')
the output is
train step: 0
is end: False
train step: 1
is end: False
eval step: 0
is end: False
eval step: 1
is end: False
eval step: 2
is end: False
eval step: 3
is end: False
eval step: 4
is end: True
train step: 2
is end: True <----- bug from here
train step: 3
is end: True
eval step: 0
is end: False
eval step: 1
is end: False
eval step: 2
is end: False
eval step: 3
is end: False
eval step: 4
is end: True
train step: 4
is end: True
Just wanted to propose my original solution again: Just keep a copy of the current gradient_state before entering a new dataset, and recover it when exit.
So the users does not need to handle the issue explicitly.
Hope it help!
From my opinion, the singleton of GradientState is a bad design.
Each dataloader has a GradientState
and they shared a global state, that sounds pretty wired.
Maybe it should be modified to: GradientState
belong to each dataloader, and there is a method is accelerator to get current GradientState
in each loop.
@jcyk https://github.com/huggingface/accelerate/pull/1483 I created a pull request to fix this bug. Could you please check if this can fix the problem your mentioned?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi, sorry to bring this old issue up again. @muellerzr @Ethan-yt I have experienced the same problem. My code is a little different from theirs. Instead of exhausting the entire validation dataset inside the training loop, I only evaluate for several batches then return to training. The structure of my code is like this
train_loader
val_loader
accelerator.prepare
val_iter = iter(val_loader)
for step, batch in enumerate(train_loader):
# train...
with accelerator.accumulate(model):
...
if accelerator.sync_gradient:
...
if step % 100 == 0:
for _ in range(10):
batch = next(val_iter, None)
if batch is None:
# here, when recreating the iterator for val_loader, it will override accelerator.gradient_state.active_dataloader
# thus next time when it is exhausted, end_of_dataloader will be set to True.
# So gradient_sync will always be True and gradient accumulation will fail to work until next validation step.
val_iter = iter(val_loader)
batch = next(val_iter)
# validation...
...
In this particular case, train_loader and val_loader are used alternatively. So the tracking mechanism of the active dataloader won't work. In fact, I don't understand why we need to track the validation dataloader at all... It has nothing to do with gradient accumulation. Maybe we can only track the first dataloader instance in accelerator.prepare? Or a way to exclude a dataloader from being tracked and related with gradient accumulation?
Hi all, any updates here?
Hi, right now I figured out a way to bypass this tracking mechanism while still having distributed validation. Here is my ugly hack code:
def patch_val_loader(loader):
ori_begin_method = loader.begin
ori_end_method = loader.end
loader.begin = lambda: (ori_begin_method(), ori_end_method())[0]
loader.end = lambda: None
patch_val_loader(val_loader)
After patching it, you are good to go. No matter how you use your validation data loader inside the training loop, it won't affect the gradient state.
I encounter the same problem, and I figure out accelerate==0.25.0
works fine
I seem to be experiencing this again even though I'm running with accelerate 0.31.0