pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

`self.log` raised error when number of dataloader is not consistent

Open ding3820 opened this issue 2 years ago • 1 comments

Bug description

Hi all,

I posted a discussion in Lightning.ai forum here and @awaelchli suggested me reporting an issue.

There might be a bug in the way self.log recording dataloader_idx. If we have two validation dataloaders, says A and B. We use A every epoch, but only use B every 2 epoch. However, while using self.log, in validation_step(), an error would show up:

You called self.log({name}, ...) twice in {fx} with different arguments. This is not allowed

(see here)

The way I implement it was by switching the available dataloaders in val_dataloader() and reload dataloader every epoch.

def val_dataloader():
    if self.should_run_B():
        return [loader_A, loader_B]
    else:
        return [loader_A]

I notice that when we are at the epoch that only use one validation dataloader, dataloader_idx is always None. On the other hand, when we have two validation dataloaders, dataloader_idx would be sequentially presented as 0 or 1 in validation_step. If I understand correctly, this is the main reason causing the error.

Another interesting finding is that if we set add_dataloader_idx=True for all self.log, the program would run without error. But the tensorboard logging would be wrong that shows and c_0 andc_0/dataloader_idx_0 together (See snapshot below). These two were meant to remain in the same figure but somehow it got split into two figures. Probably it is because of the alternating dataloader design. image

How to reproduce the bug

import torch
from torch.utils.data import DataLoader
from pytorch_lightning.demos.boring_classes import BoringModel, RandomDataset
from pytorch_lightning import Trainer

class TestModel(BoringModel):
    def training_step(self, batch, batch_idx):
        out = super().training_step(batch, batch_idx)
        self.log("a", out["loss"])
        self.log("b", out["loss"], on_step=True, on_epoch=True)
        return out

    def validation_step(self, batch, batch_idx, dataloader_idx=0):
        out = super().validation_step(batch, batch_idx)
        if dataloader_idx == 0:
            self.log("c_0", out["x"], add_dataloader_idx=False)
            self.log("d_0", out["x"], on_step=True, on_epoch=True, add_dataloader_idx=False)
        elif dataloader_idx == 1:
            self.log("c_1", out["x"], add_dataloader_idx=False)
            self.log("d_1", out["x"], on_step=True, on_epoch=True, add_dataloader_idx=False)
        return out

    def validation_epoch_end(self, outputs):
        self.log("g", torch.tensor(2, device=self.device), on_epoch=True)

    def val_dataloader(self):
        if self.current_epoch % 2:
            return [DataLoader(RandomDataset(32, 64)), DataLoader(RandomDataset(32, 64))]
        else:
            return [DataLoader(RandomDataset(32, 64))]

model = TestModel()

trainer = Trainer(
    reload_dataloaders_every_n_epochs=1,
    default_root_dir="test",
    max_epochs=10,
    log_every_n_steps=1,
    enable_model_summary=False,
)
trainer.fit(model)

Error messages and logs

You called `self.log(c_0, ...)` twice in `validation_step` with different arguments. This is not allowed

Environment

  • CUDA:
    • GPU:
      • NVIDIA TITAN RTX
    • available: True
    • version: 11.7
  • Lightning:
    • pytorch-lightning: 1.7.2
    • pytorch-quantization: 2.1.2
    • torch: 1.12.0a0+8a1a93a
    • torch-tensorrt: 1.1.0a0
    • torchmetrics: 0.9.3
    • torchtext: 0.13.0a0
    • torchvision: 0.13.0a0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.13
    • version: #63-Ubuntu SMP Thu Nov 24 13:43:17 UTC 2022

More info

No response

cc @carmocca @Blaizzy

ding3820 avatar Jan 19 '23 03:01 ding3820

Have you solved the problem? I met this question too.

jin1041 avatar Jan 29 '24 06:01 jin1041