TubeDETR icon indicating copy to clipboard operation
TubeDETR copied to clipboard

an inplace operation

Open zhl98 opened this issue 1 year ago • 5 comments

Hello, I encountered this problem during the training process. Do you know where the problem is? The dimension of torch. cuda. LongTensor is [1,25]

image

zhl98 avatar Sep 19 '23 04:09 zhl98

Hi, can you give more context on the issue so that I can help you?

antoyang avatar Sep 26 '23 12:09 antoyang

Hello, I encountered this problem during the training process. Do you know where the problem is? The dimension of torch. cuda. LongTensor is [1,25]

image

Hi, I have the same problem, did you figure out how to fix this?

SkylerSuen avatar Oct 20 '23 07:10 SkylerSuen

I encountered a similar error: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 13]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I set torch.autograd.set_detect_anomaly(True), and get the following information:


/scratch/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Error detected in EmbeddingBackward0. Traceback of forward call that caused the error:
  File "main_new.py", line 651, in <module>
    main(args)
  File "main_new.py", line 591, in main
    train_stats = train_one_epoch(
  File "/scratch/TubeDETR-main/engine.py", line 67, in train_one_epoch
    memory_cache = model(
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/TubeDETR-main/models/tubedetr.py", line 190, in forward
    memory_cache = self.transformer(
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/TubeDETR-main/models/transformer.py", line 256, in forward
    encoded_text = self.text_encoder(**tokenized)
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/miniconda3/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 828, in forward
    embedding_output = self.embeddings(
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/miniconda3/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 126, in forward
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
 (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "main_new.py", line 651, in <module>
    main(args)
  File "main_new.py", line 591, in main
    train_stats = train_one_epoch(
  File "/scratch/TubeDETR-main/engine.py", line 148, in train_one_epoch
    losses.backward()
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/scratch/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 13]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Infinitywxh avatar Nov 22 '23 13:11 Infinitywxh

I encountered the same problem. Is there anyone who solved it?

furkancoskun avatar Jun 07 '24 16:06 furkancoskun

I solved the problem with adding broadcast_buffers=False to torch.nn.parallel.DistributedDataParallel

change main.py line 373 as following

        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.gpu], find_unused_parameters=True, broadcast_buffers=False,
        )

furkancoskun avatar Jun 08 '24 12:06 furkancoskun