pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Mixing the order of `--config` and `fit` in LightningCLI can cause confusion

Open awaelchli opened this issue 1 year ago • 8 comments

Bug description

If you launch with

python main.py --config ... fit 

instead of

python main.py fit --config ...

Then you end up with cryptic errors such as

No action for key "trainer.accelerator

See report on twitter: https://x.com/4ndr3aR/status/1772676605837484054?s=20

image

This is because the LightningCLI parser is built after applying the fit stage is parsed, on only that can match the provided config file. In general, the order matters for jsonargparse for good reasons.

Is there something we can do to improve the error for the user, with a sanity check before parsing begins?

Repro example:

import sys

import torch
from lightning.pytorch import LightningModule
from torch.utils.data import DataLoader, Dataset

from lightning.pytorch.cli import LightningCLI


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        return self(batch).sum()

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

    def train_dataloader(self):
        return DataLoader(RandomDataset(32, 64), batch_size=2)


sys.argv = ["bug_report_model.py", "--config", "config.yaml", "fit"]
cli = LightningCLI(BoringModel)

cc @borda @carmocca @mauvilsa

awaelchli avatar Mar 29 '24 13:03 awaelchli

It also has the consequence that with LightningCLI(..., run=False) you can't provide a config file, and if you attempt you get:

error: Validation failed: No action for key "ckpt_path" to check its value.

awaelchli avatar Mar 29 '24 14:03 awaelchli

The problem is I don't know how to make a smart error message here, because the config file could also contain a section fit and then it works. Making a smart error message here would require parsing the config file before making a decision. Would we want this approach?

awaelchli avatar Mar 29 '24 15:03 awaelchli

One idea I can propose is to have a way to disable the --config option before the subcommand. It is true that a global config file with a fit: entry or any other subcommand. But I don't think that everyone needs or wants this option. Not being there avoids the possibility of confusion, since --config will only be accepted after the subcommand.

mauvilsa avatar Mar 30 '24 14:03 mauvilsa

Hey @mauvilsa thanks for the input. It's sort of what I tried to do in #19715 until I saw the tests failing that had a fit: entry. I think that even if we did that restriction, the only way for doing LightningCLI(..., run=False) with a config file would be to have the fit: entry so I'm not sure we can consider the option to remove this feature.

awaelchli avatar Mar 31 '24 01:03 awaelchli

Actually what I meant was to have a parameter, e.g. multi_subcommand_config: bool = True that if set to False then the global --config is not added. By default it should be True, otherwise it would be a breaking change.

mauvilsa avatar Mar 31 '24 13:03 mauvilsa

I wouldn't touch anything here. This is a feature (as already noticed) and I don't expect anybody to go ahead and set this strange flag that most won't understand why it exists.

I would say that a lot of the confusion stems from the error message presented by jsonargparse: "No action for key foo to check its value". @mauvilsa This of assumes that the user of the CLI understands how the CLI works (what is an action, what is a key, what is checking their values) but that's generally not true as most of the users are not the developers who implemented the CLI and the users have no idea about what's going on under the hood.

Could jsonargparse update or provide a way to customize this message? For the LightningCLI and the CLI in litgpt it would be better to show something like "Failed to parse the foo key in your config or arguments. Make sure the format matches that returned by --print_config"

carmocca avatar Apr 03 '24 11:04 carmocca

Could jsonargparse update or provide a way to customize this message? For the LightningCLI and the CLI in litgpt it would be better to show something like "Failed to parse the foo key in your config or arguments. Make sure the format matches that returned by --print_config"

The confusion reported in this issue is about providing a config before or after the subcommand. The custom error message above does not help. In fact, --print_config can also be used before or after the subcommand, making still confusing.

I do agree that the error messages in jsonargparse can be improved, which is something generic, and not particular to lightning. And better to invest time on that, than supporting a way to customize error messages which also requires effort. One possibility could be to track the source of the problematic key, and print different error messages depending on it origin. E.g. if the problem is before the subcommand, the error could suggest the user to look at the help without providing a subcommand. And if it is after, then the error could suggest the help for that particular subcommand.

mauvilsa avatar Apr 04 '24 06:04 mauvilsa