Fix: Allow list[str] for pandas DataFrame columns and index parameters
Summary
Fixes #1657.
This fixes a type checking bug where Pyrefly incorrectly flagged errors when using list[str] for columns and index parameters in the pd.DataFrame constructor:
import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'c'], index=['a','b','c'])
# Error: Argument `list[str]` is not assignable to parameter `columns`
# Error: Argument `list[str]` is not assignable to parameter `index`
The root cause is pandas 2.x's incorrect SequenceNotStr protocol definition in typeshed. The protocol's index() method signature is:
def index(self, value: Any, /, start: int = 0, stop: int = ...) -> int
This makes only value position-only, but list.index() requires ALL parameters to be position-only. Since list.index(value, start, stop, /) doesn't structurally match the protocol (parameter kinds differ), list[str] fails to satisfy the protocol constraint.
The Fix
Added corrected pandas stubs to pyrefly's bundled typeshed matching the fix from pandas main branch
(targeted for pandas 2.3/3.0). The corrected SequenceNotStr protocol in pandas/_typing.pyi now defines:
def index(self, value: Any, start: int = ..., stop: int = ..., /) -> int
All parameters are position-only, matching list.index() signature. This allows list[str] to properly satisfy the SequenceNotStr[str] protocol used by DataFrame's columns and index parameters.
The stubs include:
pandas/_typing.pyi- CorrectedSequenceNotStrprotocol and type aliasespandas/__init__.pyi- DataFrame, Series exports and common functionspandas/core/frame.pyi- DataFrame class definitionpandas/core/series.pyi- Series class definitionMETADATA.toml- Package metadatapandas/core/__init__.pyi- Core module stub Total: 6 stub files (82 lines) + test (140 lines)
Test Plan
- Added regression tests in
pyrefly/lib/test/pandas/dataframe.rs:
cargo test test_dataframe
Result: passes
references :
- https://github.com/pandas-dev/pandas/issues/56995
- https://github.com/pandas-dev/pandas/blob/main/pandas/_typing.py
Thanks for the PR, and for looking into Pandas to diagnose our issues!
Requesting review from @rchen152 because I think she has the best context on stub bundling - as a rule of thumb I don't think we would want to override released stubs, but if the typeshed stubs are badly broken for a library this important we might need to consider it.
We also might be able to hack Pyrefly internals a bit if necessary as an alternative - in theory I think an edge case in is_subset_eq might actually be easier to maintain than stub overrides.
Hi @stroxler,
Thanks for the suggestion! I added an edge case in is_subset_eq that allows position-only parameters (PosOnly) to match regular positional parameters (Pos) in protocol checking.
Added test_dataframe_with_broken_stubs which uses the actual broken pandas 2.x stubs and verifies list[str] correctly matches SequenceNotStr[Any].
Testing
cargo test test_dataframe
Result: passes
@stroxler has imported this pull request. If you are a Meta employee, you can view this in D89300691.
@stroxler merged this pull request in facebook/pyrefly@ec5bed344da722bdeaaecefea2a58fba6757fdfe.