pandera
pandera copied to clipboard
`validate` is slow with when coercing several hundreds columns.
Describe the bug
Validating against a SchemaModel with several hundred is used with coerce takes a lot of time, even if the dataframe is already valid. It doesn’t occur when there is no coerce.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
- [ ] (optional) I have confirmed this bug exists on the master branch of pandera.
Code Sample, a copy-pastable example
import pandas as pd
import pandera as pa
import numpy as np
from pandera.typing import Series
class TestCoerce(pa.SchemaModel):
a: Series[float] = pa.Field(alias="a\d+", regex=True, coerce=True)
class TestNoCoerce(pa.SchemaModel):
a: Series[float] = pa.Field(alias="a\d+", regex=True, coerce=False)
def gen_df(
value: float = 1.618,
col_number: int = 40,
row_number: int = int(1e6),
):
return pd.DataFrame(
{
f"a{i}": np.full(row_number, value)
for i in range(col_number)
}
)
df = gen_df()
TestCoerce.validate(df)
In this gist you will find a script that compares execution time with and without coerce : https://gist.github.com/koalp/0e70303c014712a6f7f790b5743482a3
Expected behavior
That the coercion doesn’t take so much time when the dtype is already good. It would be even better to not be slow when all the columns must be converted.
Desktop (please complete the following information):
- OS: linux
- Python 3.9, 3.10
- Pandera 0.13.4
Additional context
After running benchmarks, I found out that the __setattr__ function¹ from pandas (replacing a column) takes a lot of time to run. (python 3.9)
If I modify pandera to only setattr it the result from try_coercion differs from the previous column it solves my issue as I currently only have 1 or less column that need to be changed (wrong dtype). However, it isn’t a generic solution as it doesn’t help when a lot of columns have a wrong dtype.
On discord, a modification was suggested:
I think an alternative and potentially faster solution would be to check if the dtype of
obj[matched_colname]is the same ascol_schema.dtype. If so, then coercion isn't necessary. If not, then apply coercion and reassign the column.
Thanks for opening this @koalp ! I think a good solution here is to check if the type of the incoming data matches the expected type, and only coercing/re-assigning columns that don't match.
Will circle back to this issue once https://github.com/unionai-oss/pandera/pull/913 is merged
@cosmicBboy any chance of revisiting this? We are using coerce extensively in our codebase and would be great to improve validation time.
Hi @sk- can you provide a schema and data examples that roughly match what you're using in your codebase? Doesn't have to be exactly the same, but will help when benchmarking
friendly ping @sk-