transformers
transformers copied to clipboard
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
System Info
transformers 4.28.1 torch 2.0.0 torchaudio 2.0.0 torchvision 0.15.0 huggingface-hub 0.13.4 trl 0.4.2.dev0
Who can help?
Probably people from accelerate, trainer, and text: @pacman100, @sgugger, @ArthurZucker
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
- Install the TRL package from (https://github.com/lvwerra/trl)
- Clone the package and go to
trl/examples/summarization/scripts - Setup
accelerate configlike this
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_offload_params: false
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: GPT2Block
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
- call
accelerate launch reward_summarization.py
This results in the following error:
/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Error detected in WhereBackward0. Traceback of forward call that caused the error:
File "reward_summarization.py", line 203, in <module>
trainer.train(script_args.resume_from_checkpoint)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 2699, in training_step
loss = self.compute_loss(model, inputs)
File "reward_summarization.py", line 185, in compute_loss
rewards_j = model(input_ids=inputs["input_ids_j"], attention_mask=inputs["attention_mask_j"])[0]
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1420, in forward
transformer_outputs = self.transformer(
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward
outputs = block(
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 389, in forward
attn_outputs = self.attn(
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 330, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 201, in _attn
attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
(Triggered internally at /opt/conda/conda-bld/pytorch_1678402379298/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "reward_summarization.py", line 203, in <module>
trainer.train(script_args.resume_from_checkpoint)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 2717, in training_step
loss.backward()
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABoolType [1, 1, 385, 385]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
Expected behavior
I expect it should run fine, but it ends in that error. Although it is not a native huggingFace code, it seems that it the issue is from the gpt2 trainer code.
I cannot transfer the issue to the trl repo but it should be opened there since the bug is in their example.
@sgugger I already have posted it there, and it seems that the issue is not on TRL side.
torch.autograd.set_detect_anomaly(True) reports that the root of issue might be in line 201 in site-packages/transformers/models/gpt2/modeling_gpt2.py
Turned out that modifying line 201 as below solves the issue.
attn_weights = torch.where(causal_mask.clone(), attn_weights.to(attn_weights.dtype).clone(), mask_value)
Remember that it was:
attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
@sgugger Do you know if it is a safe modification?
This will break the flow of the gradients from the attention weights, so no it's a good fix.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Any update on this? I am having the same issue
I'm experiencing same issue with WhisperModel
Actually according to torch, the clone() operation is not breaking the flow of the gradient. see here:
This function is differentiable, so gradients will flow back from the result of this operation to input. To create a tensor without an autograd relationship to input see detach().
Apparently, previous torch version did not check for these, but gradients were wrong (source is a lost stack overflow thread), there are at least 5 more issues linked to this one: #25130, #22225, #15677, #14179, #24996, #23087. Now wether this was fixed in the latest versions of torch or not is also a question, but all these issues use FSDP.
Every inplace operation seems to be causing this. But we have a lot of these 😓 cc @pacman100 wondering what you would recommend? Should we make everything compatible removing inplace operations? Seems kind of impractible.
This wrapper : https://github.com/pytorch/pytorch/blob/main/torch/autograd/graph.py#L508 seems to add clone() wherever its needed. Might be something to do there?
We should also PIN the issue to redirect everyone that has FSDP + inplace operation issue.
Also removing all inplace operations might make the memory used a bit higher, so would love if there was an alternative solution for FSDP/
I'm hitting the same issue, while trying to get the gpt2 embeddings of target via the following call:
self.gpt2.transformer.wte(target)
Error message:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
However, I did a trick like below, it succeeded.
self.gpt2.transformer.wte(target.clone())
BTW, gpt2 model is set on evaluation mode. self.gpt2.eval()
Hello,
cc @pacman100 wondering what you would recommend? Should we make everything compatible removing inplace operations? Seems kind of impractible
I don't have any recommendations at present other than replacing in place operations. Let me try this example once to see if this persists with the latest PyTorch version.
Will mark as WIP as this is not something we are working on
The error is triggered by DDP buffer broadcasting mechanism.
We need to set broadcast_buffers=False to avoid it.
model = torch.nn.parallel.DistributedDataParallel(model, broadcast_buffers=False, ...)