transformers icon indicating copy to clipboard operation
transformers copied to clipboard

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Open oroojlooy opened this issue 2 years ago • 9 comments

System Info

transformers 4.28.1 torch 2.0.0 torchaudio 2.0.0 torchvision 0.15.0 huggingface-hub 0.13.4 trl 0.4.2.dev0

Who can help?

Probably people from accelerate, trainer, and text: @pacman100, @sgugger, @ArthurZucker

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

  1. Install the TRL package from (https://github.com/lvwerra/trl)
  2. Clone the package and go to trl/examples/summarization/scripts
  3. Setup accelerate config like this
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: GPT2Block
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
  1. call accelerate launch reward_summarization.py

This results in the following error:

/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Error detected in WhereBackward0. Traceback of forward call that caused the error:
  File "reward_summarization.py", line 203, in <module>
    trainer.train(script_args.resume_from_checkpoint)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 2699, in training_step
    loss = self.compute_loss(model, inputs)
  File "reward_summarization.py", line 185, in compute_loss
    rewards_j = model(input_ids=inputs["input_ids_j"], attention_mask=inputs["attention_mask_j"])[0]
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1420, in forward
    transformer_outputs = self.transformer(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward
    outputs = block(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 389, in forward
    attn_outputs = self.attn(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 330, in forward
    attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 201, in _attn
    attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
 (Triggered internally at /opt/conda/conda-bld/pytorch_1678402379298/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "reward_summarization.py", line 203, in <module>
    trainer.train(script_args.resume_from_checkpoint)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 2717, in training_step
    loss.backward()
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABoolType [1, 1, 385, 385]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Expected behavior

I expect it should run fine, but it ends in that error. Although it is not a native huggingFace code, it seems that it the issue is from the gpt2 trainer code.

oroojlooy avatar Apr 25 '23 17:04 oroojlooy

I cannot transfer the issue to the trl repo but it should be opened there since the bug is in their example.

sgugger avatar Apr 25 '23 17:04 sgugger

@sgugger I already have posted it there, and it seems that the issue is not on TRL side.

oroojlooy avatar Apr 25 '23 18:04 oroojlooy

torch.autograd.set_detect_anomaly(True) reports that the root of issue might be in line 201 in site-packages/transformers/models/gpt2/modeling_gpt2.py

image

oroojlooy avatar Apr 25 '23 18:04 oroojlooy

Turned out that modifying line 201 as below solves the issue. attn_weights = torch.where(causal_mask.clone(), attn_weights.to(attn_weights.dtype).clone(), mask_value) Remember that it was: attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)

@sgugger Do you know if it is a safe modification?

oroojlooy avatar Apr 28 '23 23:04 oroojlooy

This will break the flow of the gradients from the attention weights, so no it's a good fix.

sgugger avatar May 01 '23 13:05 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 26 '23 15:05 github-actions[bot]

Any update on this? I am having the same issue

mukhal avatar Jun 03 '23 08:06 mukhal

I'm experiencing same issue with WhisperModel

pfeatherstone avatar Jun 06 '23 10:06 pfeatherstone

Actually according to torch, the clone() operation is not breaking the flow of the gradient. see here:

This function is differentiable, so gradients will flow back from the result of this operation to input. To create a tensor without an autograd relationship to input see detach().

Apparently, previous torch version did not check for these, but gradients were wrong (source is a lost stack overflow thread), there are at least 5 more issues linked to this one: #25130, #22225, #15677, #14179, #24996, #23087. Now wether this was fixed in the latest versions of torch or not is also a question, but all these issues use FSDP.

Every inplace operation seems to be causing this. But we have a lot of these 😓 cc @pacman100 wondering what you would recommend? Should we make everything compatible removing inplace operations? Seems kind of impractible.

This wrapper : https://github.com/pytorch/pytorch/blob/main/torch/autograd/graph.py#L508 seems to add clone() wherever its needed. Might be something to do there?

We should also PIN the issue to redirect everyone that has FSDP + inplace operation issue.

ArthurZucker avatar Jul 27 '23 09:07 ArthurZucker

Also removing all inplace operations might make the memory used a bit higher, so would love if there was an alternative solution for FSDP/

sgugger avatar Jul 27 '23 12:07 sgugger

I'm hitting the same issue, while trying to get the gpt2 embeddings of target via the following call:

self.gpt2.transformer.wte(target)

Error message:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:

However, I did a trick like below, it succeeded.

self.gpt2.transformer.wte(target.clone())

BTW, gpt2 model is set on evaluation mode. self.gpt2.eval()

kevin-s-wang avatar Dec 13 '23 12:12 kevin-s-wang

Hello,

cc @pacman100 wondering what you would recommend? Should we make everything compatible removing inplace operations? Seems kind of impractible

I don't have any recommendations at present other than replacing in place operations. Let me try this example once to see if this persists with the latest PyTorch version.

pacman100 avatar Dec 15 '23 16:12 pacman100

Will mark as WIP as this is not something we are working on

ArthurZucker avatar Jan 10 '24 10:01 ArthurZucker

The error is triggered by DDP buffer broadcasting mechanism. We need to set broadcast_buffers=False to avoid it.

model = torch.nn.parallel.DistributedDataParallel(model, broadcast_buffers=False, ...)

nguyentanthong avatar Feb 21 '24 15:02 nguyentanthong