ruff
ruff copied to clipboard
Folder with module name detected as source leads to false `I001` error
This issue affects the expected sorting of imports in Python and hence (incorrectly) triggers a I001 error.
When there is a folder that has the same name as a module, it is (possibly incorrectly) identified as SourceMatch. Ruff then categorizes the module as Known(FirstParty) and adapts the expected sorting accordingly.
This happens commonly, but not exclusively, when using the wandb library because it creates a wandb folder, as discovered in https://github.com/ChartBoost/ruff-action/issues/20.
To reproduce the issue, create a Python file with these imports (I called it test.py):
import csv
import logging
import random
import sys
from typing import Any, Optional, TextIO
import torch
import torch.nn as nn
from torch import optim
from torcheval.metrics import (
Metric,
MulticlassAccuracy,
MulticlassF1Score,
MulticlassPrecision,
MulticlassRecall,
)
from tqdm import tqdm
import wandb
In the initial scenario, there is a wandb directory:
% ls -d wandb/
wandb/
Running Ruff to check the import sorting:
% poetry run ruff check -v --select=I001 test.py
[2024-03-22][08:57:28][ruff::resolve][DEBUG] Using configuration file (via parent) at: /Users/carstenschnober/LAHTeR/workspace/document-segmentation/pyproject.toml
[2024-03-22][08:57:28][ruff::commands::check][DEBUG] Identified files to lint in: 2.013375ms
[2024-03-22][08:57:28][ruff::diagnostics][DEBUG] Checking: /Users/carstenschnober/LAHTeR/workspace/document-segmentation/test.py
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch.nn' as Known(ThirdParty) (NoMatch)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'csv' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'random' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'sys' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'logging' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch' as Known(ThirdParty) (NoMatch)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'wandb' as Known(FirstParty) (SourceMatch("/Users/carstenschnober/LAHTeR/workspace/document-segmentation"))
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch' as Known(ThirdParty) (NoMatch)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torcheval.metrics' as Known(ThirdParty) (NoMatch)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'typing' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][08:57:28][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'tqdm' as Known(ThirdParty) (NoMatch)
[2024-03-22][08:57:28][ruff::commands::check][DEBUG] Checked 1 files in: 765.709µs
All checks passed!
The checks pass, wandb has been categorized as Known(FirstParty)
Now remove the wandb directory:
% mv wandb wandb.bak
% ls -d wandb/
ls: wandb/: No such file or directory
Running the same Ruff check triggers a I001 error on the same file, categorizing wandb as Known(ThirdParty); the module categorization is cached, so I remove the .ruff_cache directory first to reproduce the error:
% rm -r .ruff_cache
% poetry run ruff check -v --select=I001 test.py
[2024-03-22][09:06:42][ruff::resolve][DEBUG] Using configuration file (via parent) at: /Users/carstenschnober/LAHTeR/workspace/document-segmentation/pyproject.toml
[2024-03-22][09:06:42][ruff::commands::check][DEBUG] Identified files to lint in: 1.959083ms
[2024-03-22][09:06:42][ruff::diagnostics][DEBUG] Checking: /Users/carstenschnober/LAHTeR/workspace/document-segmentation/test.py
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch.nn' as Known(ThirdParty) (NoMatch)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'csv' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'random' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'sys' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'logging' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch' as Known(ThirdParty) (NoMatch)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'wandb' as Known(FirstParty) (SourceMatch("/Users/carstenschnober/LAHTeR/workspace/document-segmentation"))
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torch' as Known(ThirdParty) (NoMatch)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'torcheval.metrics' as Known(ThirdParty) (NoMatch)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'typing' as Known(StandardLibrary) (KnownStandardLibrary)
[2024-03-22][09:06:42][ruff_linter::rules::isort::categorize][DEBUG] Categorized 'tqdm' as Known(ThirdParty) (NoMatch)
[2024-03-22][09:06:42][ruff::commands::check][DEBUG] Checked 1 files in: 1.004834ms
test.py:1:1: I001 [*] Import block is un-sorted or un-formatted
Found 1 error.
[*] 1 fixable with the `--fix` option.
This is now the expected sorting that is generated when calling ruff --fix call above:
import csv
import logging
import random
import sys
from typing import Any, Optional, TextIO
import torch
import torch.nn as nn
import wandb
from torch import optim
from torcheval.metrics import (
Metric,
MulticlassAccuracy,
MulticlassF1Score,
MulticlassPrecision,
MulticlassRecall,
)
from tqdm import tqdm
This configuration option fixes the issue properly (see https://github.com/ChartBoost/ruff-action/issues/20#issuecomment-2012747916):
[tool.ruff.lint.isort]
known-third-party = ["wandb"]
However, it is difficult for users to identify the issue and fix the configuration accordingly. I think a better solution would be to have a more robust source directory detection.
A heuristics like checking for __init__.py or generally the presence of *.py as a condition file might be a solid starting point for Python.
We could consider requiring an __init__.py but it would be a breaking change, I think.
We could consider requiring an
__init__.pybut it would be a breaking change, I think.
I see how a folder that is now correctly categorized as a source folder might not have an __init__.py in some cases, so this requirement could perhaps be something for the next major release indeed.
If a source folder is required to contain any *.py file, however, I cannot see how it would break current behaviour -- except for cases in which the sorting has now been incorrectly adapted to identified source folders.
So perhaps requiring any .py file could be an intermediate fix, whereas the next major release requires an __init__.py?
If somebody could point me to the relevant source code, I could try and work on a PR.
Running the same Ruff check triggers a I001 error on the same file, categorizing wandb as Known(ThirdParty); the module categorization is cached, so I remove the .ruff_cache directory first to reproduce the error
Shouldn't the removal of the wandb directory be enough to invalidate the module categorization cache? The fact that you need to manually remove the cache to get the correct results also seems wrong.
Diagnostics are cached on a per-file basis, where the "file" is the file in which the diagnostic is present. Changing other files on the filesystem doesn't invalidate the cache.
Changing the settings will also invalidate the cache. So if you add wandb as a known-third-party module (which is the suggested change), it will also reflect that change on re-run without clearing the cache mnanually.
Say you were using a third party package in your requirements, but decided later to vendor it to make changes, turning it into a first party package. Then unless caches were manually cleaned you would get incorrect results from the isort check. Even worse, they could be inconsistent with CI checks, which can cause a lot of confusion. We ran into an issue like this today, and had to turn off caching as a result.
Say you were using a third party package in your requirements, but decided later to vendor it to make changes, turning it into a first party package. Then unless caches were manually cleaned you would get incorrect results from the isort check. Even worse, they could be inconsistent with CI checks, which can cause a lot of confusion. We ran into an issue like this today, and had to turn off caching as a result.
Honestly, I think this should be a separate issue about caching. It is only related in the sense that caching must somehow also handle the detection of third party folders. But if you propose improvements in the caching logic, I think it would make sense to do this separately from this issue (detection of source folders).
We're looking into changing how we cache data as part of our multifile analysis work. It will allow us to invalidate caches based on dependencies.