pytorch-lightning
pytorch-lightning copied to clipboard
`self.log` raised error when number of dataloader is not consistent
Bug description
Hi all,
I posted a discussion in Lightning.ai forum here and @awaelchli suggested me reporting an issue.
There might be a bug in the way self.log
recording dataloader_idx
. If we have two validation dataloaders, says A and B. We use A every epoch, but only use B every 2 epoch. However, while using self.log
, in validation_step()
, an error would show up:
You called self.log({name}, ...) twice in {fx} with different arguments. This is not allowed
(see here)
The way I implement it was by switching the available dataloaders in val_dataloader()
and reload dataloader every epoch.
def val_dataloader():
if self.should_run_B():
return [loader_A, loader_B]
else:
return [loader_A]
I notice that when we are at the epoch that only use one validation dataloader, dataloader_idx is always None. On the other hand, when we have two validation dataloaders, dataloader_idx would be sequentially presented as 0 or 1 in validation_step. If I understand correctly, this is the main reason causing the error.
Another interesting finding is that if we set add_dataloader_idx=True
for all self.log
, the program would run without error. But the tensorboard logging would be wrong that shows and c_0
andc_0/dataloader_idx_0
together (See snapshot below). These two were meant to remain in the same figure but somehow it got split into two figures. Probably it is because of the alternating dataloader design.
How to reproduce the bug
import torch
from torch.utils.data import DataLoader
from pytorch_lightning.demos.boring_classes import BoringModel, RandomDataset
from pytorch_lightning import Trainer
class TestModel(BoringModel):
def training_step(self, batch, batch_idx):
out = super().training_step(batch, batch_idx)
self.log("a", out["loss"])
self.log("b", out["loss"], on_step=True, on_epoch=True)
return out
def validation_step(self, batch, batch_idx, dataloader_idx=0):
out = super().validation_step(batch, batch_idx)
if dataloader_idx == 0:
self.log("c_0", out["x"], add_dataloader_idx=False)
self.log("d_0", out["x"], on_step=True, on_epoch=True, add_dataloader_idx=False)
elif dataloader_idx == 1:
self.log("c_1", out["x"], add_dataloader_idx=False)
self.log("d_1", out["x"], on_step=True, on_epoch=True, add_dataloader_idx=False)
return out
def validation_epoch_end(self, outputs):
self.log("g", torch.tensor(2, device=self.device), on_epoch=True)
def val_dataloader(self):
if self.current_epoch % 2:
return [DataLoader(RandomDataset(32, 64)), DataLoader(RandomDataset(32, 64))]
else:
return [DataLoader(RandomDataset(32, 64))]
model = TestModel()
trainer = Trainer(
reload_dataloaders_every_n_epochs=1,
default_root_dir="test",
max_epochs=10,
log_every_n_steps=1,
enable_model_summary=False,
)
trainer.fit(model)
Error messages and logs
You called `self.log(c_0, ...)` twice in `validation_step` with different arguments. This is not allowed
Environment
- CUDA:
- GPU:
- NVIDIA TITAN RTX
- available: True
- version: 11.7
- GPU:
- Lightning:
- pytorch-lightning: 1.7.2
- pytorch-quantization: 2.1.2
- torch: 1.12.0a0+8a1a93a
- torch-tensorrt: 1.1.0a0
- torchmetrics: 0.9.3
- torchtext: 0.13.0a0
- torchvision: 0.13.0a0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.13
- version: #63-Ubuntu SMP Thu Nov 24 13:43:17 UTC 2022
More info
No response
cc @carmocca @Blaizzy
Have you solved the problem? I met this question too.