Pandas filter inferred as Series instead of DataFrame
This bug is transferred from https://github.com/microsoft/pyright/issues/950:
Describe the bug
Output of filtering a pandas DataFrame (like df[df.col == x]), when it is fact another DataFrame.
To Reproduce
df1: pd.DataFrame = pd.DataFrame(
[[1, 2], [3, 4]],
columns=['a', 'b']
)
df1 = df1[df1.col == 1]
Error:
"Series[Dtype]" is incompatible with "DataFrame" Pylance (reportGeneralTypeIssues)
If I remove the pd.DataFrame type annotation, it works. But this is still significant because it applies to functions too:
def f(df: pd.DataFrame) -> pd.DataFrame:
return df1[df1.col == 1]
Gives the same error.
Expected behavior Pyright should recognise the output as a DataFrame and not give any errors.
VS Code extension or command-line Pylance 2020.8.1, pyright 1.1.61
Here are the stubs for DataFrame.__getitem__:
@overload
def __getitem__(self, idx: _str) -> Series[Dtype]: ...
@overload
def __getitem__(self, rows: slice) -> DataFrame: ...
@overload
def __getitem__(
self, idx: Union[Series[_bool], DataFrame, List[_str], Index[_str], np.ndarray_str],
) -> DataFrame: ...
Looking at those, it looks like the stubs are saying the return type is a DataFrame.
There's also this one inherited from NDFrame:
def __getitem__(self, item) -> None: ...
and this one from SelectionMixin:
def __getitem__(self, key): ...
but neither of those should cause a Series to be returned. So it seems like a pyright issue?
pyright doesn't seem to have a problem with it:
PS C:\Users\italo\Desktop> cat .\test.py
import pandas as pd
df1: pd.DataFrame = pd.DataFrame(
[[1, 2], [3, 4]],
columns=['a', 'b']
)
df1 = df1[df1.col == 1]
PS C:\Users\italo\Desktop> pyright .\test.py
stubPath C:\Users\italo\Desktop\typings is not a valid directory.
Assuming Python platform Windows
Searching for source files
Found 1 source file
0 errors, 0 warnings
Completed in 0.6sec
Pyright doesn't come with these stubs at all, and if you don't configure it to do so, it won't scan libraries for types either. It's possible that it's inferring things as Unknown which is another name for Any unless you are enabling it in strict mode.
I see. How can I test the CLI pyright with the pylance stubs to make sure it is indeed a pylance or pyright issue?
You could copy the bundled stubs to the typings directory (the one it warns about not existing) and see if changes.
Note that I don't think we would differ in this situation. Pylance doesn't modify the core type checking that pyright does in any major fashion besides adding extra stubs and features on top.
There clearly is an issue here Italo; I think the team should look into it more deeply. It may be related to how we process overloads. I'm not suggesting the issue be moved to pyright; we can do that once we have verified further.
Is this a dupe of #276 (or vice versa), then? The analysis is split between them, but preferably we'd merge if they're the same.
Ah, I see the problem. The expression df1.col is an unknown type because the __getattr__ method on NDFrame doesn't have a return type annotation.
def __getattr__(self, name: _str) : ...
This means that the expression df1.col == 1 also has an unknown type because Pyright can't evaluate the __eq__ method.
When matching overloads for __getitem__, the first overload is chosen because the argument is of an unknown type.
This issue also happens with .loc when called with a callable as the first argument.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list("ABCD"))
df2 = df.loc[lambda x: x.A > 50, ["A", "B"]] # inferred type is Series[Dtype]
Using the dataframe directly, the type is inferred correctly:
df2 = df.loc[df.A > 50, ["A", "B"]] # inferred type is DataFrame
This code also gives an error on the randint line ("randint" is not a known member of module). If that isn't a known bug, I can create an issue for it as well.
I seem to have a related issue when I use:
a_string_element: str = my_series_with_string_values.loc[idx]
Pylance lints this saying:
Type "Unknown | Series[Any]" cannot be assigned to type "str"
"Series[Any]" is incompatible with "str"
Pylance(reportGeneralTypeIssues)```
Hi, stumbled upon this recently. Is there a reason that this change to the __getattr__ of NDFrame (within python-type-stubs) doesn't work?
def __getattr__(self, name: _str) -> Series[Dtype]: ...
I'm not familiar with the testing process for that project, so I don't know how to tell if this breaks something. But at least using pylance in VS Code, this seems to work okay:
import pandas as pd
df = pd.DataFrame()
reveal_type(df.col) # Type of "df.col" is "Series[Dtype@__getattr__]"Pylance
reveal_type(df[df.col == 1]) # Type of "df[df.col == 1]" is "DataFrame"Pylance
Looks like the latest type stubs fix this:
