pylint Create a pre-commit confidence level

Bug description

import pandas as pd


def function1():
    """
    some function declared anywhere else in the codebase.
    in another file and never called in my case.
    """
    df = pd.DataFrame()
    df.columns = [1, 2, 3]


def function2():
    """
    some function which sometimes operates on
    a default constructed DataFrame object
    """
    df = pd.DataFrame()
    # pylint incorrectly infers the type of `k` to be int
    df.rename(columns={k: k.lower() for k in df.columns})

EDIT: a simpler repro without pandas

class SomeType:
    def __init__(self):
        self.columns: list[str] = []

def function1():
    """
    some function declared anywhere else in the codebase.
    in another file and never called in my case.
    """
    df = SomeType()
    df.columns = [1, 2, 3]


def function2():
    df = SomeType()
    # pylint incorrectly infers the type of `k` to be int
    return {k: k.lower() for k in df.columns}

Configuration

No response

Command used

pylint repro.py

Pylint output

************* Module repro
repro.py:20:26: E1101: Instance of 'int' has no 'lower' member (no-member)

Expected behavior

No diagnostic

Pylint version

pylint 3.0.3
astroid 3.0.2
Python 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:03:24) [GCC 12.3.0]

OS / Environment

No response

Additional dependencies

pandas 2.1.4 or 1.5.3

Jan 02 '24 21:01 apmorton

Thank you for opening the issue. I'm hesitating to label this a false positive, because it seems to me that pylint being able to detect that sometime columns can be list[int] (and looking everywhere it can to find that info) bring a lot of value (especially in half typed / untyped code base).

Jan 04 '24 14:01 Pierre-Sassoulas

I'm hesitating to label this a false positive

Ack that the behavior may be useful in some contexts, but I would definitely say this is a false positive. It is not possible for int to be observed in function2, and so the warning pylint emits is not helpful.

Additionally, this causes "spooky action at a distance" because it is not deterministic.

If function1 and function2 are in different files you will get different diagnostics when running: pylint file1.py; pylint file2.py and pylint file1.py file2.py.

This comes up all the time in large projects which run pylint on subsets of the project in parallel (for example, using pre-commit).

An almost daily interaction I am having right now at work is with someone less knowledgeable about pylint/pre-commit/etc asking why seemingly unrelated changes cause this (or similar) error to appear and fail their build.

Jan 04 '24 19:01 apmorton

Thanks for the report. I'm hearing two requests. One is to distinguish:

def function2():
    df = SomeType()
    return {k: k.lower() for k in df.columns}

from this, which is an error:

def function2():
    df = function1()  # assume function1 returns `df`
    return {k: k.lower() for k in df.columns}

That's not really on pylint's roadmap, which I know might be frustrating, but pylint/astroid's inference capabilities advertise 'all the values your variables might take' -- in other words, we've inherited a system which collects all the possible values for SomeType.columns and doesn't know what functions you've called on them by the time we lint one of the instances you have in hand.

The other request I'm hearing is to make this more deterministic. That's totally reasonable, but I'm having trouble reproducing.

% cat a.py
class SomeType:
    def __init__(self):
        self.columns: list[str] = []

def function2():
    df = SomeType()
    return {k: k.lower() for k in df.columns}

% cat b.py
from .a import SomeType

def function1():
    de = SomeType()
    de.columns = [1, 2, 3]

% pylint a.py b.py
************* Module pylint.a
a.py:7:15: E1101: Instance of 'int' has no 'lower' member (no-member)

------------------------------------------------------------------
Your code has been rated at 5.00/10 (previous run: 5.00/10, +0.00)

Feb 21 '24 15:02 jacobtylerwalls

$ pylint a.py
$ pylint b.py
$ pylint a.py b.py
************* Module a
a.py:7:15: E1101: Instance of 'int' has no 'lower' member (no-member)

The issue is that when linted in isolation these files produce no diagnostic, but when linted together they do.

pylint 3.0.3
astroid 3.0.3
Python 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]

Feb 21 '24 15:02 apmorton

But my example shows that I tried that, no? I ran pylint a.py b.py and got a no-member.

Feb 21 '24 15:02 jacobtylerwalls

Are you defining the class SomeType anew in each file? That could explain the difference.

Feb 21 '24 15:02 jacobtylerwalls

Are you defining the class SomeType anew in each file? That could explain the difference.

The contents of a.py and b.py in my above post are identical to what you showed.

But my example shows that I tried that, no? I ran pylint a.py b.py and got a no-member.

Your example shows only that you ran pylint a.py b.py and got a diagnostic.

The problem I'm trying to convey is that you cannot reproduce the diagnostic when linting either file separately.

That is what makes this inconsistent and causes us problems in our environment.

If you have a large project that uses pre-commit you will get spooky action at a distance where simply adding or removing a completely unrelated file will change diagnostic output because different files are passed to a single invocation of pylint.

I guess the root question is whether or not you consider the diagnostic emitted by the following code to be a false positive or not:

class SomeType:
    def __init__(self):
        self.columns: list[str] = []

def function1():
    """
    some function declared anywhere else in the codebase.
    in another file and never called in my case.
    """
    df = SomeType()
    df.columns = [1, 2, 3]


def function2():
    df = SomeType()
    # pylint incorrectly infers the type of `k` to be int
    return {k: k.lower() for k in df.columns}

I do, because the code in function1 is not modifying anything about the SomeType type, it is changing an instance property - it can never be observed in function2.

I understand fixing this may not be on the roadmap, or may not be feasible given the design of astroid, but can we all agree that it is a false positive?

Feb 21 '24 15:02 apmorton

Got it, I think I ran one of the pylint a.py runs with the wrong contents and got mixed up. Thanks for bearing with :pray:

I understand fixing this may not be on the roadmap, or may not be feasible given the design of astroid, but can we all agree that it is a false positive?

It would be a false positive for a linter that's control-flow aware. What if SomeType.__init__() calls function1() (edited to return something) and assigns the result to .columns? Or does that conditionally? Then it's no longer a false positive and pylint got it right. I know I'm changing your hypothetical, so whether we call it a false positive or not, I'm just trying to illuminate the attitude behind the original design we inherited. I agree with you that in the example you supplied here, the message is not helpful.

Given that, I think you raise a really good usability point about pre-commit. I don't think we've documented anywhere that if you just run pre-commit on your git diff, you're going to get fewer pylint messages for those specific files than you'd get if you linted your whole project. Linting the whole project is impractical for pre-commit, so maybe we need some sort of "confidence" level that can allow users to silence diagnostics like the one you've illustrated here.

I'm open to that, and would be eager to see someone explore the feasibility. If you don't mind, I'd like to retitle the issue to focus this for a contributor who might want to take this up. Thanks again for the issue. 👍

Feb 21 '24 16:02 jacobtylerwalls

FWIW, I wasn't even thinking about pre-commit for the git diff subset, but as a CI step run against all files.

pre-commit takes the full list of files and "deterministically" shuffles them before chunking them into command line length limit subsets and running pylint on each subset.

If you add or remove any file from the list, the deterministic shuffling changes, and the group of files passed to pylint changes.

That is what causes the most frustration for us.

Feb 21 '24 16:02 apmorton

I feel like not being able to specify that you don't want to parallelize is an issue with pre-commit and not pylint.

That being said, I think more and more that pylint should not be used in pre-commit:

pylint require an installed environment so it can't be used with pre-commit.ci (https://github.com/pylint-dev/pylint/blob/main/.pre-commit-config.yaml#L2), so you need a CI job for it
pylint use multiple files analysis so pre-commit parallelization make it worse and less deterministic
above all, pylint is slow and will always be (duplicate-code is intrinsically a costly thing to check) and committing should be fast

I'm considering adding a disclaimer in the doc to "use pylint in a continous integration job or periodically and do not use pylint during pre-commit or inside your IDE on save, pylint is not meant to do that and will only frustrate you". And adding caveat to the pre-commit doc for those that still want to use pylint in pre-commit.

Feb 22 '24 09:02 Pierre-Sassoulas

pylint use multiple files analysis so pre-commit parallelization make it worse and less deterministic

Unless we're ready to say that running pylint with a list of files is a second-class experience--which it sort of is, but should we bite that bullet?--I had in mind something relatively simple.

When I looked at the root cause yesterday, it was that the information from both files is collated together in ClassDef.instance_attrs. When building the ast's, we could collect slightly more information (what file created that instance attr?) and then in pylint, when going through all the filters to prevent false positives for no-member, we could, assuming the user is running with higher than pre-commit confidence, or with a flag --file-subset-mode or whatever, decide to be quiet when the heterogenous instance_attrs are coming from two different files. Something like that.

Feb 22 '24 14:02 jacobtylerwalls

I'm a little afraid about the maintenance implication of that pre-commit confidence. (I don't think I understand what it entices and fear that we'll have to think about what happens in pre-commit for all messages, which is scary).

Unless we're ready to say that running pylint with a list of files is a second-class experience--which it sort of is, but should we bite that bullet?--I had in mind something relatively simple.

But pylint analyse all the calls in the code, if you don't analyses the files with a function call in it, you won't have all the possible values that can enter a function. It's kinda expected. I don't think it's a wrong assumption to make or keep. Only we need to warn user that parallelizing pylint deterministically is not easy and they shouldn't try to do it themselves. Also, I searched a little and it seems that pre-commit has a require_serial option that could be be set to true.

Feb 22 '24 23:02 Pierre-Sassoulas

That's fair. Probably not worth putting on the roadmap.

We should probably document that linting project files is a poorer experience than linting a project.

Feb 23 '24 00:02 jacobtylerwalls

pylint pylint copied to clipboard

Create a pre-commit confidence level

Bug description

Configuration

Command used

Pylint output

Expected behavior

Pylint version

OS / Environment

Additional dependencies

pylint
pylint copied to clipboard