accelerate A BUG? when performing gradient accumulation

System Info

- `Accelerate` version: 0.16.0
- Platform: Linux-4.14.105-1-tlinux3-0013-x86_64-with-glibc2.2.5
- Python version: 3.8.12
- Numpy version: 1.21.4
- PyTorch version (GPU?): 1.10.0 (True)

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

step = 0
while True:
    for batch in training_dataloader:
        step += 1
        with accelerator.accumulate(model):
            inputs, targets = batch
            outputs = model(inputs)
            loss = loss_function(outputs, targets)
            accelerator.backward(loss)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
        if step % 100 == 0:
            for batch in eval_dataloader:
                inputs, targets = batch
                outputs = model(inputs)
                loss = loss_function(outputs, targets)

Expected behavior

As seen, we shuffle from training and evaluating (This is common as we monitor its performance on dev set while training the model ).

but the scheduler's step will be wrong

It seems the forward steps in evaluating also account for gradient accumulation.

Feb 09 '23 11:02 jcyk

I think I found the reason:

The loop over eval_dataloader make self.gradient_state.end_of_dataloader to be True Then self.gradient_state.sync_gradients is always set to True according to https://github.com/huggingface/accelerate/blob/978dfc38ea91aae645d780d393d5bc0e33ac8306/src/accelerate/accelerator.py#L627

Do you feel this is a bug?

Feb 09 '23 14:02 jcyk

cc @muellerzr

Feb 09 '23 14:02 sgugger

I would like to share my current solution to this problem.

step = 0
while True:
    for batch in training_dataloader:
        step += 1
        with accelerator.accumulate(model):
            inputs, targets = batch
            outputs = model(inputs)
            loss = loss_function(outputs, targets)
            accelerator.backward(loss)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
        if step % 100 == 0:
            ######## we save the state of the outer dataloader
            end_of_train_dataloader = accelerator.gradient_state.end_of_dataloader
            ########
            for batch in eval_dataloader:
                inputs, targets = batch
                outputs = model(inputs)
                loss = loss_function(outputs, targets)
            ######## we restore the state of the outer dataloader
            accelerator.gradient_state._set_end_of_dataloader(end_of_train_dataloader)
            ########

Only two lines of added code: end_of_train_dataloader = accelerator.gradient_state.end_of_dataloader and accelerator.gradient_state._set_end_of_dataloader(end_of_train_dataloader).

Let me know if you like this solution. I am happy to make a PR (add the two lines into the __iter__ function of accelerator's dataloader).

Feb 09 '23 15:02 jcyk

@jcyk correct me if I'm wrong here but is there not a bug in the code, and model.eval() and/or torch.no_grad needs to be done so gradients don't get calculated on the eval dataloader?

Feb 28 '23 14:02 muellerzr

Otherwise I think a decent solution may be something like:

with accelerator.disable_gradient_accumulation():
   ...

Which should only be used in tandem in situations like this specifically. Once entered it maintains the original state of end_of_dataloader and once exited it will reset it.

Feb 28 '23 15:02 muellerzr

This is also similar to https://github.com/huggingface/accelerate/issues/960 I believe, so the proposed solution can be done if we agree on it @sgugger and @jcyk :)

Feb 28 '23 15:02 muellerzr

@muellerzr

I think the problem is that when we finish the for batch in eval_dataloader loop. the state of end_of_dataloader is set to True. but note that we are still in the for batch in training_dataloader loop. And this will make with accelerator.accumulate(model) not work as we expetced.

I don't think with accelerator.disable_gradient_accumulation() is good enough. because, as you can see, the eval code is not in the with accelerator.accumulate(model)

Feb 28 '23 16:02 jcyk

@jcyk exactly, hence the other solution noted there about pause and resume, which discusses your exact issue.

I.e.:

dl1, dl2 = accelerator.prepare(dl1, dl2)
for i,batch in enumerate(dl1):
    ...
    if i == 5:
        accelerator.gradient_state.pause()
        for batch in dl2:
            ....
        accelerator.gradient_state.resume()
    ...

Feb 28 '23 16:02 muellerzr

Hi @jcyk, apologies for the wait on this. Could you try again by installing accelerate with pip install git+https://github.com/huggingface/accelerate@dataloader-multistate?

You shouldn't need the if/else there, just run the code as normal without any modification, and it should work as you expect. Thanks!

Mar 07 '23 16:03 muellerzr

@muellerzr I have just come across the same issue.

I ran pip install git+https://github.com/huggingface/accelerate.git --upgrade but the issue remains the same. Had to run:

end_of_train_dataloader = accelerator.gradient_state.end_of_dataloader
# evaluate...
accelerator.gradient_state._set_end_of_dataloader(end_of_train_dataloader)

like @jcyk suggested which fixed the issue for me.

I love accelerate, it is awesome, but I do have one suggestion - perhaps updating the examples more frequently will allow such bugs to be discovered earlier. For example the with accelerator.accumulate(model): is absent from the examples I saw (e.g. complete_nlp_example).

Thanks!

Mar 15 '23 09:03 yuvalkirstain

@yuvalkirstain can you provide a reproducer for me? When I tested this earlier it worked OOTB now with Gradient accumulation automatically. And make sure you are running version 0.18.0dev, or do pip install accelerate -U to ensure you're running 0.17.1

Mar 15 '23 11:03 muellerzr

https://github.com/huggingface/accelerate/pull/1162 This doesn't seem to be enough to solve the problem. When the second dataloader finishes, it removes it and sets the state to "True", which affects the external state.

May 27 '23 13:05 Ethan-yt

@Ethan-yt can you provide a reproducer for me to test with? Thanks!

May 27 '23 14:05 muellerzr

@Ethan-yt can you provide a reproducer for me to test with? Thanks!

Sure. Here is a simple code for this bug:

from torch.utils.data import DataLoader

from accelerate.accelerator import Accelerator
from accelerate.test_utils import RegressionDataset


accelerator = Accelerator()
train_dataset = RegressionDataset(length=80)
train_dataloader = DataLoader(train_dataset, batch_size=16)
eval_dataset = RegressionDataset(length=80)
eval_dataloader = DataLoader(eval_dataset, batch_size=16)
train_dataloader, eval_dataloader = accelerator.prepare(train_dataloader, eval_dataloader)
for train_step, _ in enumerate(train_dataloader):
    print(f'train step: {train_step}')
    print(f'is end: {accelerator.gradient_state.end_of_dataloader}')
    # training...
    if train_step % 2 == 1:
        # eval...
        for eval_step, _ in enumerate(eval_dataloader):
            print(f'eval step: {eval_step}')
            print(f'is end: {accelerator.gradient_state.end_of_dataloader}')

the output is

train step: 0
is end: False
train step: 1
is end: False
eval step: 0
is end: False
eval step: 1
is end: False
eval step: 2
is end: False
eval step: 3
is end: False
eval step: 4
is end: True
train step: 2
is end: True    <----- bug from here
train step: 3
is end: True
eval step: 0
is end: False
eval step: 1
is end: False
eval step: 2
is end: False
eval step: 3
is end: False
eval step: 4
is end: True
train step: 4
is end: True

May 27 '23 14:05 Ethan-yt

Just wanted to propose my original solution again: Just keep a copy of the current gradient_state before entering a new dataset, and recover it when exit.

So the users does not need to handle the issue explicitly.

Hope it help!

May 27 '23 16:05 jcyk

From my opinion, the singleton of GradientState is a bad design. Each dataloader has a GradientState and they shared a global state, that sounds pretty wired. Maybe it should be modified to: GradientState belong to each dataloader, and there is a method is accelerator to get current GradientState in each loop.

May 27 '23 16:05 Ethan-yt

@jcyk https://github.com/huggingface/accelerate/pull/1483 I created a pull request to fix this bug. Could you please check if this can fix the problem your mentioned?

May 29 '23 07:05 Ethan-yt

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 22 '23 15:06 github-actions[bot]

Hi, sorry to bring this old issue up again. @muellerzr @Ethan-yt I have experienced the same problem. My code is a little different from theirs. Instead of exhausting the entire validation dataset inside the training loop, I only evaluate for several batches then return to training. The structure of my code is like this

train_loader
val_loader
accelerator.prepare

val_iter = iter(val_loader)

for step, batch in enumerate(train_loader):
    # train...
    with accelerator.accumulate(model):
        ...
    if accelerator.sync_gradient:
        ...
    if step % 100 == 0:
        for _ in range(10):
            batch = next(val_iter, None)
            if batch is None:
                # here, when recreating the iterator for val_loader, it will override accelerator.gradient_state.active_dataloader
                # thus next time when it is exhausted, end_of_dataloader will be set to True. 
                # So gradient_sync will always be True and gradient accumulation will fail to work until next validation step.
                val_iter = iter(val_loader)
                batch = next(val_iter)
            # validation...
            ...

In this particular case, train_loader and val_loader are used alternatively. So the tracking mechanism of the active dataloader won't work. In fact, I don't understand why we need to track the validation dataloader at all... It has nothing to do with gradient accumulation. Maybe we can only track the first dataloader instance in accelerator.prepare? Or a way to exclude a dataloader from being tracked and related with gradient accumulation?

Sep 24 '23 15:09 Xiashangning

Hi all, any updates here?

Oct 29 '23 14:10 lzy37ld

Hi, right now I figured out a way to bypass this tracking mechanism while still having distributed validation. Here is my ugly hack code:

def patch_val_loader(loader):
        ori_begin_method = loader.begin
        ori_end_method = loader.end
        loader.begin = lambda: (ori_begin_method(), ori_end_method())[0]
        loader.end = lambda: None

patch_val_loader(val_loader)

After patching it, you are good to go. No matter how you use your validation data loader inside the training loop, it won't affect the gradient state.

Oct 29 '23 20:10 Xiashangning

I encounter the same problem, and I figure out accelerate==0.25.0 works fine

Dec 03 '23 10:12 Hannibal046

I seem to be experiencing this again even though I'm running with accelerate 0.31.0

Jun 21 '24 07:06 marhlder

accelerate accelerate copied to clipboard

A BUG? when performing gradient accumulation

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard