pandera
pandera copied to clipboard
Add option for specifying labels or indices as refferal to columns in Fields.
In instances where the DataFrame has column names with any arbitrary value, but the index of the column is used it would be nice with a way to refer to these columns by index instead of name when creating Schema Models. Examples:
class IntStrFrame(pa.SchemaModel):
first_col: Series[int]
second_col: Series[str]
df1 = pd.DataFrame({'cola': [1, 2, 3], 'colb': ['a', 'b', 'c']})
df2 = pd.DataFrame({'1': [1, 2, 3], '2': ['a', 'b', 'c']})
df3 = pd.DataFrame({'1': ['a', 'b', 'c'], '2': [1, 2, 3]})
@pa.check_types
def some_func(df: DataFrame[IntStrFrame]):
return df
some_func(df1)# <- Shoud pass as the first col is of type int, and the second is of type str
some_func(df2)# <- Shoud pass as the first col is of type int, and the second is of type str
some_func(df3)# <- Shoud fail as the first col is of type str
Proposed Schema Model solutution by @cosmicBboy:
class Schema(pa.SchemaModel):
col1: Series[int] = Field(..., indexes=[0])
col2: Series[str] = Field(..., indexes=[1])
This proposal is in line with how the 'regex' argument in 'Field' works as it can match multiple columns. This solution can also be extended to allow for multiple lables instead of indices for when you have multiple columns with the same restrictions:
class Schema(pa.SchemaModel):
col_type_1: Series[int] = Field(..., labels=['foo', 'bar'])
Hi @Fransisnk this use case makes sense!
Welcoming PRs to fulfill this use case, adding the help wanted tag