transformers
transformers copied to clipboard
Parameter at index 195 has been marked as ready twice.
System Info
-
transformers
version: 4.28.0 - Platform: Linux-5.4.0-122-generic-x86_64-with-glibc2.31
- Python version: 3.9.12
- Huggingface_hub version: 0.13.4
- Safetensors version: not installed
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
Who can help?
@ArthurZucker @younesbelkada
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I retrained Roberta on my own corpus with the MLM task. I set model.gradient_checkpointing_enable()
to save memory.
model = RobertaModel.from_pretrained(model_name_or_path,config=config)
model.gradient_checkpointing_enable() # Activate gradient checkpointing
model = Model(model,config,tokenizer,args)
My model:
class Model(nn.Module):
def __init__(self, model,config,tokenizer,args):
super(Model, self).__init__()
self.encoder = model
self.config = config
self.tokenizer = tokenizer
self.args = args
self.lm_head = nn.Linear(config.hidden_size,config.vocab_size)
self.lm_head.weight = self.encoder.embeddings.word_embeddings.weight
self.register_buffer(
"bias", torch.tril(torch.ones((args.block_size, args.block_size), dtype=torch.uint8)).view(1, args.block_size, args.block_size)
)
def forward(self, mlm_ids):
...
There is an error:
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parame
ter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. o
r try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multipl
e reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result
in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple
times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does n
ot change over iterations.
Parameter at index 195 with name encoder.encoder.layer.11.output.LayerNorm.weight has been marked as ready twice. This means that multiple
autograd engine hooks have fired for this particular parameter during this iteration.
If I get rid of this line of code:model.gradient_checkpointing_enable()
, it is ok. Why?
Expected behavior
I want to pre-train with gradient_checkpointing
.
There is little we can do to help without seeing a full reproducer.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Got exact same bug when gradient_checkpointing_enable()
Are you using DDP?
I am using DDP on two GPUs:
python -m torch.distributed.run --nproc_per_node 2 run_audio_classification.py
(run because launch fails)
All the rest being equal facebook/wav2vec2-base works if gradient_checkpointing is set to True, however, the large model crashes unless the option it is either set to False or removed.
gradient_checkpointing works for both models if using a single GPU, so the issue seems to be DDP-related.
This seems to come from:
https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/reducer.cpp
The problem may be that when the trainer is invoked from torchrun is setting find_unused_parameters to True for all devices, when, apparently, it should only do it for the first one:
https://discuss.pytorch.org/t/finding-the-cause-of-runtimeerror-expected-to-mark-a-variable-ready-only-once/124428/3
And the reason why the base model works is because that option can be set to False. However, for the large model it has to be True.
The solution would be changing the way in which that argument is parsed.
Thank you @mirix , Making ddp_find_unused_parameters=False
in Trainer solved this issue for me.
if you use enable_gradient_checkpointing()
you can now overcome this issue by passing gradient_checkpointing_kwargs={"use_reentrant": False}
model.enable_gradient_checkpointing(gradient_checkpointing_kwargs={"use_reentrant": False})