pandera
pandera copied to clipboard
Improve warning message when using the same Column object multiple times in DataFrameSchema
Question about pandera
When using Pandera, I've tried to reuse Column objects for multiple rows. However, since there are side effects on any Column objects used in the DataFrameSchema constructor, I find myslef having to do a workaround using lambdas.
For example, the following code triggers a warning:
import pandas as pd
import pandera as pa
df = pd.DataFrame({
'positive': [1],
'also_positive': [4],
'negative': [-1]
})
positive_check = pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))
negative_check = pa.Column(pa.Int, pa.Check.less_than(0))
print(positive_check)
schema = pa.DataFrameSchema({
'positive': positive_check,
'also_positive': positive_check,
'negative': negative_check
})
print(positive_check)
Output
Before creating schema: <Schema Column(name=None, type=int)>
/usr/local/lib/python3.9/site-packages/pandera/schemas.py:220: UserWarning: resetting column for <Schema Column(name=positive, type=int)> to 'also_positive'.
warnings.warn(
After creating schema: <Schema Column(name=also_positive, type=int)>
Whereas the following code runs with no issues:
import pandas as pd
import pandera as pa
df = pd.DataFrame({
'positive': [1, 2, 3],
'also_positive': [4, 5, 6],
'negative': [-1, -2, -3]
})
positive_check = lambda: pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))
negative_check = lambda: pa.Column(pa.Int, pa.Check.less_than(0))
schema = pa.DataFrameSchema({
'positive': positive_check(),
'also_positive': positive_check(),
'negative': negative_check()
})
I find it to be a common use case to reuse checks. Is this really the intended behavior of the DataFrameSchema constructor? I realize that changing this would be a breaking change but is it something you would consider for a future v1.0?
hey @Anders-E, the user warning you're seeing is because checks should be reusable but by assigning a Column object to a variable
positive_check = pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))
You're basically re-using the entire column definition for two different columns (a Column is a schema-like object that holds metadata about the column, not a Check). There are several ways of applying the same set of checks to a group of columns. I'll list them in order of my personal preference on how I use pandera:
Specify a different Column object for each column
For simple cases where the column definition doesn't involve a lot of checks/option kwargs
schema = pa.DataFrameSchema({
'positive': pa.Column(pa.Int, pa.Check.ge(0)), # ge is the shorter hand for greater_than_or_equal_to
'also_positive': pa.Column(pa.Int, pa.Check.ge(0)),
'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})
# here you can even define a variable for the common arguments
positive_col_args = [pa.Int, pa.Check.ge(0)]
schema = pa.DataFrameSchema({
'positive': pa.Column(*positive_col_args),
'also_positive': pa.Column(*positive_col_args),
'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})
Use regex=True
This applies a single column definition to a set of columns, which are identified at validation time with the column key by a regular expression.
schema = pa.DataFrameSchema({
'(positive|also_positive)': pa.Column(pa.Int, pa.Check.ge(0), regex=True),
'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})
Use partial
This is similar to the lambda solution: it guarantees that the column objects are unique, while allowing you to differentiate positive and also_positive with different kwarg options.
from functools import partial
PositiveColumn = partial(Column, pa.Int, pa.Check.ge(0))
schema = pa.DataFrameSchema({
'positive': PositiveColumn(coerce=True),
'also_positive': PositiveColumn(coerce=False),
'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})
Reusing Checks
As a related but similar topic, reusing checks is supported by pandera, just keep in mind they need to be supplied to different Column object instances
reusable_checks = [
pa.Check.ge(0),
... # a long list of checks
]
schema = pa.DataFrameSchema({
'positive': pa.Column(pa.Int, reusable_checks),
'also_positive': pa.Column(pa.Int, reusable_checks),
'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})
Let me know if you have any other questions!
Thank you for the very detailed reply @cosmicBboy !
This is indeed very helpful for my use cases, and I particularly like the solution using functools.partial.
So long story short, Column objects are not supposed to be reused. Do you think it would be a good idea to elaborate on this in the user warning?
@cosmicBboy Do you think the user warning could be improved (see my comment above)? If not I think this issue can be closed.
hey @Anders-E, a contribution on that front would be very welcome. I think better than improving the warning message, we can make this case invalid:
- in
DataFrameSchema.__init__, check whether aColumnobject is used more than once - if so, raise a
SchemaInitErrorsaying something to the effect of "you cannot use a Column object multiple times in the same DataFrameSchema.".
I think that's actually a better idea yes, cause I really can't see why you would want to proceed with using the same Column multiple times.
Do you mind if I go ahead and implement it?
yes, please do! make sure to base your changes off of the dev branch and make a PR against that branch when it's ready!
Also be sure to check out the contributing guide for recommended ways of setting up your local env, running pre-commit hooks, and tests.
Let me know if you have any other questions!
Thank you very much and thank you for the quick replies!
Hi! I just looked up this bug after running into this warning. Thanks for your work on all this @cosmicBboy
Can we make it so that on DataFrameSchema init, the Column is copied into a new Column with a modified name, instead of modifying the Column in place? This was very surprising behavior to me. This would also be in line with how pandas does it, where the names of Series is not modified when you pass them into a Dataframe:
I'd guess the cost of copying a Column wouldn't be very much?
Or, I think this would be even more disruptive, but I think it would make more sense, consider if Column's didn't even have a name attribute, all they did was define what was INSIDE a column, but didn't know anything about where the data is actually stored.
Thank you!
I completely agree with you @NickCrews , the behavior feels strange to me as it goes against the immutable style found in pandas.
This is obviously not my project or anything, but I would highly prefer having the Columns copied. As to dropping the name attribute, I don't really know what it's used for under the hood so can't really comment on that.
Agreed on copying the Column (I don't think the cost of copying would be high)
I think for this issue, let's do the following:
- Change
Column.set_nameso that it makes a copy of the Column object - Remove the warning, since the original column is no longer modified in place.
As to dropping the name attribute, I don't really know what it's used for under the hood so can't really comment on that.
consider if Column's didn't even have a name attribute, all they did was define what was INSIDE a column, but didn't know anything about where the data is actually stored.
So schema components actually have slightly different semantics than schemas... they take in a data structure and validate some part of it, so you can actually use Column objects to validate a dataframe by itself (without having to use it within a DataFrameSchema), see these docs. This is the reason it has a name attribute.
@Anders-E @NickCrews do either of you want to make a PR for ^^?
I could do a PR for this. That change to Column.set_name would be breaking for users. Do you want any sort of warning or deprecation schedule? ie leave it as is, add a warning, and in 6 months actually do the change? IDK what Pandera's policy on this is.
That makes sense why Columns need to store the name, so that they can lookup the right column from a DF, thanks! I think with this change a lot of my friction should disappear.
That change to Column.set_name would be breaking for users
would it? how?
thanks @NickCrews !
I'm not super concerned that Column.set_name will be a breaking change... as you and @Anders-E pointed out, in-plce mutation of the Column when it's fed into a DataFrameSchema is a bug, in that it's sort of unexpected/unintuitive behavior.
I was thinking that someone might be directly calling Column.set_name, since that is part of the public API, outside of the context of a DataFreSchema that we've been talking about
I not super concerned that this will have a big impact, since I'm not sure how many people using Columns by themselves rely on the fact that it's mutated in the context of a DataFrameSchema... but you're right, to be safe let's manually to a copy.deepcopy of the Column before calling set_name here: https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L282-L289
Oops sorry I didn't see your comment "how?". I wasn't thinking of people relying on Column getting mutated in DataFrameSchema, I was thinking of this:
c1 = Column(name="name1")
c1.set_name("name2")
# keep doing stuff with c1, expecting it to be named `name2`.
# After this change though, it will still be named `name1`
How do you want to deal with this?
gotcha, agreed, so let's do what I mentioned in https://github.com/pandera-dev/pandera/issues/511#issuecomment-1119620703:
to be safe let's manually to a copy.deepcopy of the Column before calling set_name here: https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L282-L289
@cosmicBboy MANY months later 🤣 , but how's that PR look?
Thanks @NickCrews, and no worries about the delay!
Looks like the test tests/core/test_schemas.py::test_dataframe_reset_column_name needs to be modified, I don't think a warning is raised anymore.
Also, working on some CI issues: https://github.com/unionai-oss/pandera/pull/1056, may need to rebase on that once it's merged