pytorch-lightning
pytorch-lightning copied to clipboard
Support `DDP(static_graph=True)` and gradient accumulation
I got SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f22c43552b0> returned NULL without setting an error
when setting accumulate_grad_batches = 2. But I see nothing helpful in the log.
Error gone when changing DDPStrategy(static_graph=False,)
, or accumulate_grad_batches
back to 1, or batch_size=3
(total len(data) = 9).
I wonder if there is some conflict between DDPStrategy.static_graph=True, accumulate_grad_batches and batch_size.
I want to keep static_graph=True
because I am using .gradient_checkpointing_enable().
Anyone helps, please.
Epoch 0: 40%|█████████████████████████████▏ | 2/5 [00:00<00:01, 2.05it/s, v_num=0]Traceback (most recent call last):
File "untitled.py", line 62, in <module>
trainer.fit(MM, train_dataloaders=train_loader)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 520, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 935, in _run
results = self._run_stage()
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 978, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
self.advance()
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
self.advance(data_fetcher)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 218, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 178, in run
closure()
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__
self._result = self.closure(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 135, in closure
self._backward_fn(step_output.closure_loss)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 233, in backward_fn
call._call_strategy_hook(self.trainer, "backward", loss, optimizer)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 288, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/strategies/strategy.py", line 199, in backward
self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 67, in backward
model.backward(tensor, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/lightning/pytorch/core/module.py", line 1054, in backward
loss.backward(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f22c43552b0> returned NULL without setting an error
Epoch 0: 40%|████ | 2/5 [00:02<00:03, 1.05s/it, v_num=0]
Minimal code to reproduce the error:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
import lightning.pytorch as pl
from lightning.pytorch.strategies import DDPStrategy
name = "hfl/chinese-roberta-wwm-ext"
class AAA(pl.LightningModule):
def __init__(self, **kwargs):
super().__init__()
self.model = BertForSequenceClassification.from_pretrained(name, num_labels=2)
def forward(self, *inputs):
outputs = self.model(inputs[0], attention_mask=inputs[1], labels=inputs[2])
loss = outputs.loss
return (loss, outputs)
def training_step(self, batch, batch_idx):
outputs = self(*batch)
loss = outputs[0]
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.model.parameters(), lr=1e-5)
MM = AAA()
train_texts = ['这是第一条训练数据', '这是第二条训练数据', '这是第三条训练数据', '这是第四条训练数据', '这是第五条训练数据', '这是第六条训练数据', '这是第七条训练数据', '这是第八条训练数据', '这是第九条训练数据']
train_labels = [1, 0, 1, 1, 0, 1, 1, 1, 1]
tokenizer = BertTokenizer.from_pretrained(name)
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
train_dataset = torch.utils.data.TensorDataset(
torch.tensor(train_encodings['input_ids']),
torch.tensor(train_encodings['attention_mask']),
torch.tensor(train_labels)
)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=2, shuffle=True)
trainer = pl.Trainer(
accelerator="auto",
devices="auto",
strategy=DDPStrategy(static_graph=True,),
precision="16-mixed",
num_sanity_val_steps=0,
max_epochs=10,
deterministic="warn",
accumulate_grad_batches=2,
)
trainer.fit(MM, train_dataloaders=train_loader)
Environment:
torch 2.0.1
torchaudio 2.0.2
torchvision 0.15.2
lightning 2.0.2
transformers 4.30.2
Originally posted by @iamlockelightning in https://github.com/Lightning-AI/pytorch-lightning/discussions/18080
I'm also observing this issue in the latest version of pytorch-lightning (2.1.3)
cc @justusschock @awaelchli
@awaelchli If help is still wanted please assign this issue to me. Have a bit of time to work on it.
Of course, @nik777, please go ahead, that would be great! No to discourage you of course, but I think it might be a hard one to solve :)
Any progress on this? Thank's so much!