pandas-vet icon indicating copy to clipboard operation
pandas-vet copied to clipboard

False positives with common method names

Open deppen8 opened this issue 5 years ago • 8 comments

This is a dedicated issue for the big discussion in #74

The problem is that many of our checks rely on the type of the object being a pandas object. This is a fundamental issue with static linting in Python because the AST doesn't know what type a thing is. This leads to false positives for things like re.sub() or dict.values()

I am open to suggestions on how to get around this, but it will likely be a big job. Some kind of integration with mypy or some other way to leverage type annotations might be a way to fix this, at least for folks who use those type annotations. What exactly that looks like is unclear to me, so please let me know if you have any ideas.

For now, the undesirable workaround is to turn off checks that are particularly bothersome.

deppen8 avatar Feb 07 '20 17:02 deppen8

Here is a snippet of the ast produced with type annotations in python 3.8 (I imagine you would get the same or similar output using typed_ast (https://github.com/python/typed_ast) in earlier python versions). Note the annotations are decorated in the function argument and in the return value. (ast printing with https://github.com/asottile/astpretty)

>>> import pandas as pd; import numpy as np; import ast; import astpretty
>>> src = "def f(s: pd.core.series.Series) -> np.ndarray: return s.values"
>>> astpretty.pprint(ast.parse(src).body[0])
FunctionDef(
    lineno=1,
    col_offset=0,
    end_lineno=1,
    end_col_offset=62,
    name='f',
    args=arguments(
        posonlyargs=[],
        args=[
            arg(
                lineno=1,
                col_offset=6,
                end_lineno=1,
                end_col_offset=30,
                arg='s',
                annotation=Attribute(
                    lineno=1,
                    col_offset=9,
                    end_lineno=1,
                    end_col_offset=30,
                    value=Attribute(
                        lineno=1,
                        col_offset=9,
                        end_lineno=1,
                        end_col_offset=23,
                        value=Attribute(
                            lineno=1,
                            col_offset=9,
                            end_lineno=1,
                            end_col_offset=16,
                            value=Name(lineno=1, col_offset=9, end_lineno=1, end_col_offset=11, id='pd', ctx=Load()),
                            attr='core',
                            ctx=Load(),
                        ),
                        attr='series',
                        ctx=Load(),
                    ),
                    attr='Series',
                    ctx=Load(),
                ),
                type_comment=None,
            ),
        ],
        vararg=None,
        kwonlyargs=[],
        kw_defaults=[],
        kwarg=None,
        defaults=[],
    ),
    body=[
        Return(
            lineno=1,
            col_offset=47,
            end_lineno=1,
            end_col_offset=62,
            value=Attribute(
                lineno=1,
                col_offset=54,
                end_lineno=1,
                end_col_offset=62,
                value=Name(lineno=1, col_offset=54, end_lineno=1, end_col_offset=55, id='s', ctx=Load()),
                attr='values',
                ctx=Load(),
            ),
        ),
    ],
    decorator_list=[],
    returns=Attribute(
        lineno=1,
        col_offset=35,
        end_lineno=1,
        end_col_offset=45,
        value=Name(lineno=1, col_offset=35, end_lineno=1, end_col_offset=37, id='np', ctx=Load()),
        attr='ndarray',
        ctx=Load(),
    ),
    type_comment=None,
)

However, we need the type information to be decorated on the Function.body's Attribute lookup to pick this up in visit_Attribute. The type information is not decorated down to this level of the tree. To get type information there, we'd need type inference.

Type inference is likely out of scope for this library, but mypy, pytype, jedi, etc. all do some amount of type inference internally to do their respective jobs. Here is a brief view into what is out there.

Mypy doesn't seem to expose a fully type inferred ast as something you can use in a library based on

  • https://github.com/python/mypy/issues/4713
  • https://github.com/python/mypy/pull/2444
  • https://github.com/python/mypy/issues/4868

pytype does have this functionality on its roadmap

  • https://github.com/google/pytype/issues/385
  • https://github.com/google/pytype/tree/master/pytype/tools/annotate_ast but it is reported as "in progress".

jedi seems to provide something like this feature, but I think its using its own ast and not the built-in ast module

  • https://github.com/davidhalter/jedi/issues/920

In summary, I don't see an obvious way forward without some effort, but I will continue to research this.

thomasjohns avatar Feb 10 '20 19:02 thomasjohns

I think this is difficult for mypy to support because mypy immediately converts the build-in ast into a mypy specific ast where it does type inference https://github.com/python/mypy/wiki/Implementation-Overview. I think the best option here in the short term is the new feature from Google's pytype, but it will require some defensive coding since the feature is "in progress".

I'll run some experiments with the pytype feature and see where it leads.

thomasjohns avatar Feb 10 '20 21:02 thomasjohns

Another project trying to solve this problem is https://github.com/mbdevpl/static-typing. It looks similar to pytype in regards to its "in progress" readme messaging.

thomasjohns avatar Feb 10 '20 21:02 thomasjohns

Checking if pandas is imported will help, but is not super safe. I see two options, disable check per line, if you have a conflicting line you mark it to ignore these checks, but this happens a lot with boto3 and dynamdb as an example. Another possibility is to add a safe_pandas option that disables known to be problematic checks. I will check some mid-size projects to try to get a list of these rules that are hard to deal with the current AST tooling.

lleites avatar Feb 11 '20 19:02 lleites

Could you also scan the file for multiple pandas "signatures", rather than relying on an explicit import statement? Might be able to check existing code using pandas to determine the most used methods and score the likelihood of the methods in question being on a pandas object.

On Tue, Feb 11, 2020 at 11:12 AM Leandro Leites Barrios < [email protected]> wrote:

Checking if pandas is imported will help, but is not super safe. I see two options, disable check per line, if you have a conflicting line you mark it to ignore these checks, but this happens a lot with boto3 and dynamdb as an example. Another possibility is to add a safe_pandas option that disables known to be problematic checks. I will check some mid-size projects to try to get a list of these rules that are hard to deal with the current AST tooling.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/deppen8/pandas-vet/issues/81?email_source=notifications&email_token=AAO2YGWIUDOZ36AKKPZ7AGDRCL2C5A5CNFSM4KRSFLKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELNWALY#issuecomment-584802351, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAO2YGQMYJRHM6LM4T4MM6LRCL2C5ANCNFSM4KRSFLKA .

simchuck avatar Feb 11 '20 19:02 simchuck

Thanks for chiming in, @lleites .

Checking if pandas is imported will help, but is not super safe.

Just to confirm, by "not super safe" you mean it won't fix all issues, correct? Or is there some actual security vulnerability I am not seeing?

I see two options, disable check per line, if you have a conflicting line you mark it to ignore these checks, but this happens a lot with boto3 and dynamdb as an example.

Do you mean something like #no-qa comments? I haven't checked, but flake8 might already have this.

deppen8 avatar Feb 11 '20 20:02 deppen8

Sorry that I was not clear, I mean it won't fix all issues. Yes, you can disable specific checks per line in flake8 example = lambda: 'example' # noqa: E731 http://flake8.pycqa.org/en/3.1.1/user/ignoring-errors.html From my point of view documenting which are those rules that can have false positives and how to disable them in the README.md would be also a step in the right direction. I will check these projects I mention and come with a list of rules and comment here.

lleites avatar Feb 11 '20 20:02 lleites

Adding this to the README is an excellent idea. I will create a small issue to track that.

On Tue, Feb 11, 2020, 12:37 PM Leandro Leites Barrios < [email protected]> wrote:

Sorry that I was not clear, I mean it won't fix all issues. Yes, you can disable specific checks per line in flake8 example = lambda: 'example' # noqa: E731 http://flake8.pycqa.org/en/3.1.1/user/ignoring-errors.html From my point of view documenting which are those rules that can have false positives and how to disable them in the README.md would be also a step in the right direction. I will check these projects I mention and come with a list of rules and comment here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deppen8/pandas-vet/issues/81?email_source=notifications&email_token=AG333FLQBTWIK7XM5OWJPSTRCMECDA5CNFSM4KRSFLKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELN7DHA#issuecomment-584839580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG333FJUWCB2HQE7D7WZKXTRCMECDANCNFSM4KRSFLKA .

deppen8 avatar Feb 11 '20 20:02 deppen8