pyrefly icon indicating copy to clipboard operation
pyrefly copied to clipboard

Fix: Allow list[str] for pandas DataFrame columns and index parameters

Open jackulau opened this issue 1 month ago • 2 comments

Summary

Fixes #1657.

This fixes a type checking bug where Pyrefly incorrectly flagged errors when using list[str] for columns and index parameters in the pd.DataFrame constructor:

import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'c'], index=['a','b','c'])
# Error: Argument `list[str]` is not assignable to parameter `columns`
# Error: Argument `list[str]` is not assignable to parameter `index`

The root cause is pandas 2.x's incorrect SequenceNotStr protocol definition in typeshed. The protocol's index() method signature is:

def index(self, value: Any, /, start: int = 0, stop: int = ...) -> int

This makes only value position-only, but list.index() requires ALL parameters to be position-only. Since list.index(value, start, stop, /) doesn't structurally match the protocol (parameter kinds differ), list[str] fails to satisfy the protocol constraint.

The Fix

Added corrected pandas stubs to pyrefly's bundled typeshed matching the fix from pandas main branch (targeted for pandas 2.3/3.0). The corrected SequenceNotStr protocol in pandas/_typing.pyi now defines:

def index(self, value: Any, start: int = ..., stop: int = ..., /) -> int

All parameters are position-only, matching list.index() signature. This allows list[str] to properly satisfy the SequenceNotStr[str] protocol used by DataFrame's columns and index parameters.

The stubs include:

  • pandas/_typing.pyi - Corrected SequenceNotStr protocol and type aliases
  • pandas/__init__.pyi - DataFrame, Series exports and common functions
  • pandas/core/frame.pyi - DataFrame class definition
  • pandas/core/series.pyi - Series class definition
  • METADATA.toml - Package metadata
    • pandas/core/__init__.pyi - Core module stub Total: 6 stub files (82 lines) + test (140 lines)

Test Plan

  1. Added regression tests in pyrefly/lib/test/pandas/dataframe.rs:
cargo test test_dataframe

Result: passes

references :

  • https://github.com/pandas-dev/pandas/issues/56995
  • https://github.com/pandas-dev/pandas/blob/main/pandas/_typing.py

jackulau avatar Nov 22 '25 16:11 jackulau

Thanks for the PR, and for looking into Pandas to diagnose our issues!

Requesting review from @rchen152 because I think she has the best context on stub bundling - as a rule of thumb I don't think we would want to override released stubs, but if the typeshed stubs are badly broken for a library this important we might need to consider it.

We also might be able to hack Pyrefly internals a bit if necessary as an alternative - in theory I think an edge case in is_subset_eq might actually be easier to maintain than stub overrides.

stroxler avatar Nov 22 '25 17:11 stroxler

Hi @stroxler,

Thanks for the suggestion! I added an edge case in is_subset_eq that allows position-only parameters (PosOnly) to match regular positional parameters (Pos) in protocol checking.

Added test_dataframe_with_broken_stubs which uses the actual broken pandas 2.x stubs and verifies list[str] correctly matches SequenceNotStr[Any].

Testing

cargo test test_dataframe

Result: passes

jackulau avatar Nov 22 '25 20:11 jackulau

@stroxler has imported this pull request. If you are a Meta employee, you can view this in D89300691.

meta-codesync[bot] avatar Dec 16 '25 18:12 meta-codesync[bot]

@stroxler merged this pull request in facebook/pyrefly@ec5bed344da722bdeaaecefea2a58fba6757fdfe.

meta-codesync[bot] avatar Dec 17 '25 05:12 meta-codesync[bot]