pylance-release icon indicating copy to clipboard operation
pylance-release copied to clipboard

Pandas filter inferred as Series instead of DataFrame

Open erictraut opened this issue 5 years ago • 11 comments

This bug is transferred from https://github.com/microsoft/pyright/issues/950:

Describe the bug Output of filtering a pandas DataFrame (like df[df.col == x]), when it is fact another DataFrame.

To Reproduce

df1: pd.DataFrame = pd.DataFrame(
    [[1, 2], [3, 4]],
    columns=['a', 'b']
)
df1 = df1[df1.col == 1]

Error:

  "Series[Dtype]" is incompatible with "DataFrame" Pylance (reportGeneralTypeIssues)

If I remove the pd.DataFrame type annotation, it works. But this is still significant because it applies to functions too:

def f(df: pd.DataFrame) -> pd.DataFrame:
    return df1[df1.col == 1]

Gives the same error.

Expected behavior Pyright should recognise the output as a DataFrame and not give any errors.

VS Code extension or command-line Pylance 2020.8.1, pyright 1.1.61

erictraut avatar Aug 17 '20 14:08 erictraut

Here are the stubs for DataFrame.__getitem__:

    @overload
    def __getitem__(self, idx: _str) -> Series[Dtype]: ...
    @overload
    def __getitem__(self, rows: slice) -> DataFrame: ...
    @overload
    def __getitem__(
        self, idx: Union[Series[_bool], DataFrame, List[_str], Index[_str], np.ndarray_str],
    ) -> DataFrame: ...

Looking at those, it looks like the stubs are saying the return type is a DataFrame.

There's also this one inherited from NDFrame:

    def __getitem__(self, item) -> None: ...

and this one from SelectionMixin:

    def __getitem__(self, key): ...

but neither of those should cause a Series to be returned. So it seems like a pyright issue?

gramster avatar Aug 25 '20 20:08 gramster

pyright doesn't seem to have a problem with it:

PS C:\Users\italo\Desktop> cat .\test.py
import pandas as pd

df1: pd.DataFrame = pd.DataFrame(
    [[1, 2], [3, 4]],
    columns=['a', 'b']
)
df1 = df1[df1.col == 1]
PS C:\Users\italo\Desktop> pyright .\test.py
stubPath C:\Users\italo\Desktop\typings is not a valid directory.
Assuming Python platform Windows
Searching for source files
Found 1 source file
0 errors, 0 warnings
Completed in 0.6sec

oyarsa avatar Aug 25 '20 21:08 oyarsa

Pyright doesn't come with these stubs at all, and if you don't configure it to do so, it won't scan libraries for types either. It's possible that it's inferring things as Unknown which is another name for Any unless you are enabling it in strict mode.

jakebailey avatar Aug 25 '20 21:08 jakebailey

I see. How can I test the CLI pyright with the pylance stubs to make sure it is indeed a pylance or pyright issue?

oyarsa avatar Aug 25 '20 21:08 oyarsa

You could copy the bundled stubs to the typings directory (the one it warns about not existing) and see if changes.

Note that I don't think we would differ in this situation. Pylance doesn't modify the core type checking that pyright does in any major fashion besides adding extra stubs and features on top.

jakebailey avatar Aug 25 '20 21:08 jakebailey

There clearly is an issue here Italo; I think the team should look into it more deeply. It may be related to how we process overloads. I'm not suggesting the issue be moved to pyright; we can do that once we have verified further.

gramster avatar Aug 25 '20 21:08 gramster

Is this a dupe of #276 (or vice versa), then? The analysis is split between them, but preferably we'd merge if they're the same.

jakebailey avatar Aug 25 '20 21:08 jakebailey

Ah, I see the problem. The expression df1.col is an unknown type because the __getattr__ method on NDFrame doesn't have a return type annotation.

    def __getattr__(self, name: _str) : ...

This means that the expression df1.col == 1 also has an unknown type because Pyright can't evaluate the __eq__ method.

When matching overloads for __getitem__, the first overload is chosen because the argument is of an unknown type.

erictraut avatar Aug 25 '20 21:08 erictraut

This issue also happens with .loc when called with a callable as the first argument.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list("ABCD"))

df2 = df.loc[lambda x: x.A > 50, ["A", "B"]]  # inferred type is Series[Dtype]

Using the dataframe directly, the type is inferred correctly:

df2 = df.loc[df.A > 50, ["A", "B"]]  # inferred type is DataFrame

This code also gives an error on the randint line ("randint" is not a known member of module). If that isn't a known bug, I can create an issue for it as well.

oyarsa avatar Nov 11 '20 11:11 oyarsa

I seem to have a related issue when I use: a_string_element: str = my_series_with_string_values.loc[idx]

Pylance lints this saying:

  Type "Unknown | Series[Any]" cannot be assigned to type "str"
    "Series[Any]" is incompatible with "str"
     Pylance(reportGeneralTypeIssues)```

Trezorro avatar Jan 08 '21 14:01 Trezorro

Hi, stumbled upon this recently. Is there a reason that this change to the __getattr__ of NDFrame (within python-type-stubs) doesn't work?

    def __getattr__(self, name: _str) -> Series[Dtype]: ...

I'm not familiar with the testing process for that project, so I don't know how to tell if this breaks something. But at least using pylance in VS Code, this seems to work okay:

import pandas as pd

df = pd.DataFrame()
reveal_type(df.col)  # Type of "df.col" is "Series[Dtype@__getattr__]"Pylance
reveal_type(df[df.col == 1])  # Type of "df[df.col == 1]" is "DataFrame"Pylance

tuchandra avatar Jan 19 '22 02:01 tuchandra

Looks like the latest type stubs fix this:

image

rchiodo avatar Oct 14 '22 23:10 rchiodo