pandera icon indicating copy to clipboard operation
pandera copied to clipboard

`validate` is slow with when coercing several hundreds columns.

Open locnide opened this issue 2 years ago • 4 comments
trafficstars

Describe the bug

Validating against a SchemaModel with several hundred is used with coerce takes a lot of time, even if the dataframe is already valid. It doesn’t occur when there is no coerce.

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.
  • [ ] (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
import numpy as np
from pandera.typing import Series

class TestCoerce(pa.SchemaModel):
    a: Series[float] = pa.Field(alias="a\d+", regex=True, coerce=True)


class TestNoCoerce(pa.SchemaModel):
    a: Series[float] = pa.Field(alias="a\d+", regex=True, coerce=False)


def gen_df(
    value: float = 1.618,
    col_number: int = 40,
    row_number: int = int(1e6),
):

    return pd.DataFrame(
        {
            f"a{i}": np.full(row_number, value)
            for i in range(col_number)
        }
    )

df = gen_df()
TestCoerce.validate(df)

In this gist you will find a script that compares execution time with and without coerce : https://gist.github.com/koalp/0e70303c014712a6f7f790b5743482a3

Expected behavior

That the coercion doesn’t take so much time when the dtype is already good. It would be even better to not be slow when all the columns must be converted.

Desktop (please complete the following information):

  • OS: linux
  • Python 3.9, 3.10
  • Pandera 0.13.4

Additional context

After running benchmarks, I found out that the __setattr__ function¹ from pandas (replacing a column) takes a lot of time to run. (python 3.9) If I modify pandera to only setattr it the result from try_coercion differs from the previous column it solves my issue as I currently only have 1 or less column that need to be changed (wrong dtype). However, it isn’t a generic solution as it doesn’t help when a lot of columns have a wrong dtype.

On discord, a modification was suggested:

I think an alternative and potentially faster solution would be to check if the dtype of obj[matched_colname] is the same as col_schema.dtype. If so, then coercion isn't necessary. If not, then apply coercion and reassign the column.

locnide avatar Jan 20 '23 13:01 locnide

Thanks for opening this @koalp ! I think a good solution here is to check if the type of the incoming data matches the expected type, and only coercing/re-assigning columns that don't match.

Will circle back to this issue once https://github.com/unionai-oss/pandera/pull/913 is merged

cosmicBboy avatar Jan 20 '23 14:01 cosmicBboy

@cosmicBboy any chance of revisiting this? We are using coerce extensively in our codebase and would be great to improve validation time.

sk- avatar Feb 05 '25 03:02 sk-

Hi @sk- can you provide a schema and data examples that roughly match what you're using in your codebase? Doesn't have to be exactly the same, but will help when benchmarking

cosmicBboy avatar Feb 05 '25 14:02 cosmicBboy

friendly ping @sk-

cosmicBboy avatar Feb 14 '25 22:02 cosmicBboy