pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Add option for specifying labels or indices as refferal to columns in Fields.

Open Fransisnk opened this issue 2 years ago • 1 comments

In instances where the DataFrame has column names with any arbitrary value, but the index of the column is used it would be nice with a way to refer to these columns by index instead of name when creating Schema Models. Examples:

class IntStrFrame(pa.SchemaModel):
    first_col: Series[int]
    second_col: Series[str]


df1 = pd.DataFrame({'cola': [1, 2, 3], 'colb': ['a', 'b', 'c']})
df2 = pd.DataFrame({'1': [1, 2, 3], '2': ['a', 'b', 'c']})
df3 = pd.DataFrame({'1': ['a', 'b', 'c'], '2': [1, 2, 3]})


@pa.check_types
def some_func(df: DataFrame[IntStrFrame]):
    return df


some_func(df1)# <- Shoud pass as the first col is of type int, and the second is of type str
some_func(df2)# <- Shoud pass as the first col is of type int, and the second is of type str
some_func(df3)# <- Shoud fail as the first col is of type str

Proposed Schema Model solutution by @cosmicBboy:

class Schema(pa.SchemaModel):
    col1: Series[int] = Field(..., indexes=[0])
    col2: Series[str] = Field(..., indexes=[1])

This proposal is in line with how the 'regex' argument in 'Field' works as it can match multiple columns. This solution can also be extended to allow for multiple lables instead of indices for when you have multiple columns with the same restrictions:

class Schema(pa.SchemaModel):
    col_type_1: Series[int] = Field(..., labels=['foo', 'bar'])

Fransisnk avatar Jan 11 '22 16:01 Fransisnk

Hi @Fransisnk this use case makes sense!

Welcoming PRs to fulfill this use case, adding the help wanted tag

cosmicBboy avatar Mar 27 '22 13:03 cosmicBboy