ruff icon indicating copy to clipboard operation
ruff copied to clipboard

Folder with module name detected as source leads to false `I001` error

Open carschno opened this issue 1 year ago • 7 comments
trafficstars

This issue affects the expected sorting of imports in Python and hence (incorrectly) triggers a I001 error.

When there is a folder that has the same name as a module, it is (possibly incorrectly) identified as SourceMatch. Ruff then categorizes the module as Known(FirstParty) and adapts the expected sorting accordingly.

This happens commonly, but not exclusively, when using the wandb library because it creates a wandb folder, as discovered in https://github.com/ChartBoost/ruff-action/issues/20.

To reproduce the issue, create a Python file with these imports (I called it test.py):

import csv
import logging
import random
import sys
from typing import Any, Optional, TextIO

import torch
import torch.nn as nn
from torch import optim
from torcheval.metrics import (
    Metric,
    MulticlassAccuracy,
    MulticlassF1Score,
    MulticlassPrecision,
    MulticlassRecall,
)
from tqdm import tqdm

import wandb

In the initial scenario, there is a wandb directory:

% ls -d wandb/
wandb/

Running Ruff to check the import sorting:

% poetry run ruff check -v --select=I001 test.py                                            
[2024-03-22][08:57:28][ruff::resolve][DEBUG] Using configuration file (via parent) at: /Users/carstenschnober/LAHTeR/workspace/document-segmentation/pyproject.toml
[2024-03-22][08:57:28][ruff::commands::check][DEBUG] Identified files to lint in: 2.013375ms
[2024-03-22][08:57:28][ruff::diagnostics][DEBUG] Checking: /Users/carstenschnober/LAHTeR/workspace/document-segmentation/test.py
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch.nn' as Known(ThirdParty) (NoMatch)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'csv' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'random' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'sys' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'logging' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch' as Known(ThirdParty) (NoMatch)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'wandb' as Known(FirstParty) (SourceMatch("/Users/carstenschnober/LAHTeR/workspace/document-segmentation"))
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch' as Known(ThirdParty) (NoMatch)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torcheval.metrics' as Known(ThirdParty) (NoMatch)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'typing' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'tqdm' as Known(ThirdParty) (NoMatch)
[2024-03-22][08:57:28][ruff::commands::check][DEBUG] Checked 1 files in: 765.709µs
All checks passed!

The checks pass, wandb has been categorized as Known(FirstParty)

Now remove the wandb directory:

% mv wandb wandb.bak
% ls -d wandb/
ls: wandb/: No such file or directory

Running the same Ruff check triggers a I001 error on the same file, categorizing wandb as Known(ThirdParty); the module categorization is cached, so I remove the .ruff_cache directory first to reproduce the error:

% rm -r .ruff_cache                             
% poetry run ruff check -v --select=I001 test.py
[2024-03-22][09:06:42][ruff::resolve][DEBUG] Using configuration file (via parent) at: /Users/carstenschnober/LAHTeR/workspace/document-segmentation/pyproject.toml
[2024-03-22][09:06:42][ruff::commands::check][DEBUG] Identified files to lint in: 1.959083ms
[2024-03-22][09:06:42][ruff::diagnostics][DEBUG] Checking: /Users/carstenschnober/LAHTeR/workspace/document-segmentation/test.py
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch.nn' as Known(ThirdParty) (NoMatch)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'csv' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'random' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'sys' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'logging' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch' as Known(ThirdParty) (NoMatch)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'wandb' as Known(FirstParty) (SourceMatch("/Users/carstenschnober/LAHTeR/workspace/document-segmentation"))
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch' as Known(ThirdParty) (NoMatch)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torcheval.metrics' as Known(ThirdParty) (NoMatch)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'typing' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'tqdm' as Known(ThirdParty) (NoMatch)
[2024-03-22][09:06:42][ruff::commands::check][DEBUG] Checked 1 files in: 1.004834ms
test.py:1:1: I001 [*] Import block is un-sorted or un-formatted
Found 1 error.
[*] 1 fixable with the `--fix` option.

This is now the expected sorting that is generated when calling ruff --fix call above:

import csv
import logging
import random
import sys
from typing import Any, Optional, TextIO

import torch
import torch.nn as nn
import wandb
from torch import optim
from torcheval.metrics import (
    Metric,
    MulticlassAccuracy,
    MulticlassF1Score,
    MulticlassPrecision,
    MulticlassRecall,
)
from tqdm import tqdm

This configuration option fixes the issue properly (see https://github.com/ChartBoost/ruff-action/issues/20#issuecomment-2012747916):

[tool.ruff.lint.isort]
known-third-party = ["wandb"]

However, it is difficult for users to identify the issue and fix the configuration accordingly. I think a better solution would be to have a more robust source directory detection. A heuristics like checking for __init__.py or generally the presence of *.py as a condition file might be a solid starting point for Python.

carschno avatar Mar 22 '24 08:03 carschno

We could consider requiring an __init__.py but it would be a breaking change, I think.

charliermarsh avatar Mar 24 '24 20:03 charliermarsh

We could consider requiring an __init__.py but it would be a breaking change, I think.

I see how a folder that is now correctly categorized as a source folder might not have an __init__.py in some cases, so this requirement could perhaps be something for the next major release indeed.

If a source folder is required to contain any *.py file, however, I cannot see how it would break current behaviour -- except for cases in which the sorting has now been incorrectly adapted to identified source folders. So perhaps requiring any .py file could be an intermediate fix, whereas the next major release requires an __init__.py?

If somebody could point me to the relevant source code, I could try and work on a PR.

carschno avatar Mar 25 '24 08:03 carschno

Running the same Ruff check triggers a I001 error on the same file, categorizing wandb as Known(ThirdParty); the module categorization is cached, so I remove the .ruff_cache directory first to reproduce the error

Shouldn't the removal of the wandb directory be enough to invalidate the module categorization cache? The fact that you need to manually remove the cache to get the correct results also seems wrong.

bkad avatar Apr 04 '24 00:04 bkad

Diagnostics are cached on a per-file basis, where the "file" is the file in which the diagnostic is present. Changing other files on the filesystem doesn't invalidate the cache.

Changing the settings will also invalidate the cache. So if you add wandb as a known-third-party module (which is the suggested change), it will also reflect that change on re-run without clearing the cache mnanually.

charliermarsh avatar Apr 04 '24 00:04 charliermarsh

Say you were using a third party package in your requirements, but decided later to vendor it to make changes, turning it into a first party package. Then unless caches were manually cleaned you would get incorrect results from the isort check. Even worse, they could be inconsistent with CI checks, which can cause a lot of confusion. We ran into an issue like this today, and had to turn off caching as a result.

bkad avatar Apr 04 '24 01:04 bkad

Say you were using a third party package in your requirements, but decided later to vendor it to make changes, turning it into a first party package. Then unless caches were manually cleaned you would get incorrect results from the isort check. Even worse, they could be inconsistent with CI checks, which can cause a lot of confusion. We ran into an issue like this today, and had to turn off caching as a result.

Honestly, I think this should be a separate issue about caching. It is only related in the sense that caching must somehow also handle the detection of third party folders. But if you propose improvements in the caching logic, I think it would make sense to do this separately from this issue (detection of source folders).

carschno avatar Apr 04 '24 07:04 carschno

We're looking into changing how we cache data as part of our multifile analysis work. It will allow us to invalidate caches based on dependencies.

MichaReiser avatar Apr 04 '24 07:04 MichaReiser