Question about pandera

When using Pandera, I've tried to reuse Column objects for multiple rows. However, since there are side effects on any Column objects used in the DataFrameSchema constructor, I find myslef having to do a workaround using lambdas.

For example, the following code triggers a warning:

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    'positive': [1],
    'also_positive': [4],
    'negative': [-1]
})

positive_check = pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))
negative_check = pa.Column(pa.Int, pa.Check.less_than(0))

print(positive_check)

schema = pa.DataFrameSchema({
    'positive': positive_check,
    'also_positive': positive_check,
    'negative': negative_check
})

print(positive_check)

Output

Before creating schema: <Schema Column(name=None, type=int)>
/usr/local/lib/python3.9/site-packages/pandera/schemas.py:220: UserWarning: resetting column for <Schema Column(name=positive, type=int)> to 'also_positive'.
  warnings.warn(
After creating schema: <Schema Column(name=also_positive, type=int)>

Whereas the following code runs with no issues:

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    'positive': [1, 2, 3],
    'also_positive': [4, 5, 6],
    'negative': [-1, -2, -3]
})

positive_check = lambda: pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))
negative_check = lambda: pa.Column(pa.Int, pa.Check.less_than(0))

schema = pa.DataFrameSchema({
    'positive': positive_check(),
    'also_positive': positive_check(),
    'negative': negative_check()
})

I find it to be a common use case to reuse checks. Is this really the intended behavior of the DataFrameSchema constructor? I realize that changing this would be a breaking change but is it something you would consider for a future v1.0?

Jun 04 '21 20:06 Anders-E

hey @Anders-E, the user warning you're seeing is because checks should be reusable but by assigning a Column object to a variable

positive_check = pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))

You're basically re-using the entire column definition for two different columns (a Column is a schema-like object that holds metadata about the column, not a Check). There are several ways of applying the same set of checks to a group of columns. I'll list them in order of my personal preference on how I use pandera:

Specify a different `Column` object for each column

For simple cases where the column definition doesn't involve a lot of checks/option kwargs

schema = pa.DataFrameSchema({
    'positive': pa.Column(pa.Int, pa.Check.ge(0)),  # ge is the shorter hand for greater_than_or_equal_to
    'also_positive': pa.Column(pa.Int, pa.Check.ge(0)),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

# here you can even define a variable for the common arguments
positive_col_args = [pa.Int, pa.Check.ge(0)]

schema = pa.DataFrameSchema({
    'positive': pa.Column(*positive_col_args),
    'also_positive': pa.Column(*positive_col_args),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Use `regex=True`

This applies a single column definition to a set of columns, which are identified at validation time with the column key by a regular expression.

schema = pa.DataFrameSchema({
    '(positive|also_positive)': pa.Column(pa.Int, pa.Check.ge(0), regex=True),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Use `partial`

This is similar to the lambda solution: it guarantees that the column objects are unique, while allowing you to differentiate positive and also_positive with different kwarg options.

from functools import partial

PositiveColumn = partial(Column, pa.Int, pa.Check.ge(0))

schema = pa.DataFrameSchema({
    'positive': PositiveColumn(coerce=True),
    'also_positive': PositiveColumn(coerce=False),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Reusing `Check`s

As a related but similar topic, reusing checks is supported by pandera, just keep in mind they need to be supplied to different Column object instances

reusable_checks = [
    pa.Check.ge(0),
    ... # a long list of checks
]

schema = pa.DataFrameSchema({
    'positive': pa.Column(pa.Int, reusable_checks),
    'also_positive': pa.Column(pa.Int, reusable_checks),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Let me know if you have any other questions!

Jun 06 '21 15:06 cosmicBboy

Thank you for the very detailed reply @cosmicBboy !

This is indeed very helpful for my use cases, and I particularly like the solution using functools.partial.

So long story short, Column objects are not supposed to be reused. Do you think it would be a good idea to elaborate on this in the user warning?

Jun 08 '21 20:06 Anders-E

@cosmicBboy Do you think the user warning could be improved (see my comment above)? If not I think this issue can be closed.

Jul 06 '21 13:07 Anders-E

hey @Anders-E, a contribution on that front would be very welcome. I think better than improving the warning message, we can make this case invalid:

in DataFrameSchema.__init__, check whether a Column object is used more than once
if so, raise a SchemaInitError saying something to the effect of "you cannot use a Column object multiple times in the same DataFrameSchema.".

Jul 06 '21 13:07 cosmicBboy

I think that's actually a better idea yes, cause I really can't see why you would want to proceed with using the same Column multiple times.

Do you mind if I go ahead and implement it?

Jul 06 '21 14:07 Anders-E

yes, please do! make sure to base your changes off of the dev branch and make a PR against that branch when it's ready!

Also be sure to check out the contributing guide for recommended ways of setting up your local env, running pre-commit hooks, and tests.

Let me know if you have any other questions!

Jul 06 '21 18:07 cosmicBboy

Thank you very much and thank you for the quick replies!

Jul 07 '21 07:07 Anders-E

Hi! I just looked up this bug after running into this warning. Thanks for your work on all this @cosmicBboy

Can we make it so that on DataFrameSchema init, the Column is copied into a new Column with a modified name, instead of modifying the Column in place? This was very surprising behavior to me. This would also be in line with how pandas does it, where the names of Series is not modified when you pass them into a Dataframe:

I'd guess the cost of copying a Column wouldn't be very much?

Or, I think this would be even more disruptive, but I think it would make more sense, consider if Column's didn't even have a name attribute, all they did was define what was INSIDE a column, but didn't know anything about where the data is actually stored.

Thank you!

Apr 14 '22 14:04 NickCrews

I completely agree with you @NickCrews , the behavior feels strange to me as it goes against the immutable style found in pandas.

This is obviously not my project or anything, but I would highly prefer having the Columns copied. As to dropping the name attribute, I don't really know what it's used for under the hood so can't really comment on that.

May 04 '22 08:05 Anders-E

Agreed on copying the Column (I don't think the cost of copying would be high)

I think for this issue, let's do the following:

Change Column.set_name so that it makes a copy of the Column object
Remove the warning, since the original column is no longer modified in place.

As to dropping the name attribute, I don't really know what it's used for under the hood so can't really comment on that.

consider if Column's didn't even have a name attribute, all they did was define what was INSIDE a column, but didn't know anything about where the data is actually stored.

So schema components actually have slightly different semantics than schemas... they take in a data structure and validate some part of it, so you can actually use Column objects to validate a dataframe by itself (without having to use it within a DataFrameSchema), see these docs. This is the reason it has a name attribute.

May 04 '22 13:05 cosmicBboy

@Anders-E @NickCrews do either of you want to make a PR for ^^?

May 04 '22 13:05 cosmicBboy

I could do a PR for this. That change to Column.set_name would be breaking for users. Do you want any sort of warning or deprecation schedule? ie leave it as is, add a warning, and in 6 months actually do the change? IDK what Pandera's policy on this is.

That makes sense why Columns need to store the name, so that they can lookup the right column from a DF, thanks! I think with this change a lot of my friction should disappear.

May 04 '22 16:05 NickCrews

That change to Column.set_name would be breaking for users

would it? how?

May 04 '22 19:05 cosmicBboy

thanks @NickCrews !

I'm not super concerned that Column.set_name will be a breaking change... as you and @Anders-E pointed out, in-plce mutation of the Column when it's fed into a DataFrameSchema is a bug, in that it's sort of unexpected/unintuitive behavior.

May 04 '22 19:05 cosmicBboy

I was thinking that someone might be directly calling Column.set_name, since that is part of the public API, outside of the context of a DataFreSchema that we've been talking about

May 04 '22 23:05 NickCrews

I not super concerned that this will have a big impact, since I'm not sure how many people using Columns by themselves rely on the fact that it's mutated in the context of a DataFrameSchema... but you're right, to be safe let's manually to a copy.deepcopy of the Column before calling set_name here: https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L282-L289

May 06 '22 13:05 cosmicBboy

Oops sorry I didn't see your comment "how?". I wasn't thinking of people relying on Column getting mutated in DataFrameSchema, I was thinking of this:

c1 = Column(name="name1")
c1.set_name("name2")
# keep doing stuff with c1, expecting it to be named `name2`.
# After this change though, it will still be named `name1`

How do you want to deal with this?

May 06 '22 17:05 NickCrews

gotcha, agreed, so let's do what I mentioned in https://github.com/pandera-dev/pandera/issues/511#issuecomment-1119620703:

to be safe let's manually to a copy.deepcopy of the Column before calling set_name here: https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L282-L289

May 06 '22 20:05 cosmicBboy

@cosmicBboy MANY months later 🤣 , but how's that PR look?

Dec 14 '22 22:12 NickCrews

Thanks @NickCrews, and no worries about the delay!

Looks like the test tests/core/test_schemas.py::test_dataframe_reset_column_name needs to be modified, I don't think a warning is raised anymore.

Also, working on some CI issues: https://github.com/unionai-oss/pandera/pull/1056, may need to rebase on that once it's merged

Dec 15 '22 19:12 cosmicBboy

pandera
pandera copied to clipboard

Improve warning message when using the same Column object multiple times in DataFrameSchema

Question about pandera

Output

Specify a different `Column` object for each column

Use `regex=True`

Use `partial`

Reusing `Check`s

pandera pandera copied to clipboard

Improve warning message when using the same Column object multiple times in DataFrameSchema

Question about pandera

Output

Specify a different Column object for each column

Use regex=True

Use partial

Reusing Checks

pandera
pandera copied to clipboard

Specify a different `Column` object for each column

Use `regex=True`

Use `partial`

Reusing `Check`s