transformers Parameter at index 195 has been marked as ready twice.

System Info

transformers version: 4.28.0
Platform: Linux-5.4.0-122-generic-x86_64-with-glibc2.31
Python version: 3.9.12
Huggingface_hub version: 0.13.4
Safetensors version: not installed
PyTorch version (GPU?): 1.13.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@ArthurZucker @younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I retrained Roberta on my own corpus with the MLM task. I set model.gradient_checkpointing_enable() to save memory.

model = RobertaModel.from_pretrained(model_name_or_path,config=config)
model.gradient_checkpointing_enable()  # Activate gradient checkpointing
model = Model(model,config,tokenizer,args)

My model:

class Model(nn.Module):   
    def __init__(self, model,config,tokenizer,args):
        super(Model, self).__init__()
        self.encoder = model
        self.config = config
        self.tokenizer = tokenizer
        self.args = args
        self.lm_head = nn.Linear(config.hidden_size,config.vocab_size)
        self.lm_head.weight = self.encoder.embeddings.word_embeddings.weight
        self.register_buffer(
        "bias", torch.tril(torch.ones((args.block_size, args.block_size), dtype=torch.uint8)).view(1, args.block_size, args.block_size)
        )

   def forward(self, mlm_ids): 
...

There is an error:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parame
ter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. o
r try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multipl
e reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result 
in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple
 times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does n
ot change over iterations.                                                                                                                
Parameter at index 195 with name encoder.encoder.layer.11.output.LayerNorm.weight has been marked as ready twice. This means that multiple
 autograd engine  hooks have fired for this particular parameter during this iteration.

If I get rid of this line of code：model.gradient_checkpointing_enable(), it is ok. Why?

Expected behavior

I want to pre-train with gradient_checkpointing.

Apr 27 '23 06:04 skye95git

There is little we can do to help without seeing a full reproducer.

Apr 27 '23 12:04 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 27 '23 15:05 github-actions[bot]

Got exact same bug when gradient_checkpointing_enable()

Jun 07 '23 21:06 CrissBrian

Are you using DDP?

I am using DDP on two GPUs:

python -m torch.distributed.run --nproc_per_node 2 run_audio_classification.py

(run because launch fails)

All the rest being equal facebook/wav2vec2-base works if gradient_checkpointing is set to True, however, the large model crashes unless the option it is either set to False or removed.

gradient_checkpointing works for both models if using a single GPU, so the issue seems to be DDP-related.

This seems to come from:

https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/reducer.cpp

Oct 13 '23 09:10 mirix

The problem may be that when the trainer is invoked from torchrun is setting find_unused_parameters to True for all devices, when, apparently, it should only do it for the first one:

https://discuss.pytorch.org/t/finding-the-cause-of-runtimeerror-expected-to-mark-a-variable-ready-only-once/124428/3

And the reason why the base model works is because that option can be set to False. However, for the large model it has to be True.

The solution would be changing the way in which that argument is parsed.

Oct 13 '23 09:10 mirix

Thank you @mirix , Making ddp_find_unused_parameters=False in Trainer solved this issue for me.

Nov 03 '23 12:11 infinitylogesh

if you use enable_gradient_checkpointing() you can now overcome this issue by passing gradient_checkpointing_kwargs={"use_reentrant": False}

model.enable_gradient_checkpointing(gradient_checkpointing_kwargs={"use_reentrant": False})

Nov 03 '23 12:11 younesbelkada

transformers transformers copied to clipboard

Parameter at index 195 has been marked as ready twice.

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard