pandera
pandera copied to clipboard
add "nullable" option to SchemaModel Config class
I have a situation where almost all of the columns in my schemas are nullable, and it would be nice to set nullable = True
as a config option instead of setting nullable=True
for every column. For example, instead of this:
import pandera as pa
from pandera.typing import Series, DataFrame
class MySchema(pa.SchemaModel):
nullable_col_1: Series[float] = pa.Field(nullable=True)
nullable_col_2: Series[float] = pa.Field(nullable=True)
nullable_col_3: Series[float] = pa.Field(nullable=True)
nullable_col_4: Series[float] = pa.Field(nullable=True)
nullable_col_5: Series[float]
nullable_col_6: Series[float] = pa.Field(nullable=True)
I'd love to be able to do this or something like it:
import pandera as pa
from pandera.typing import Series, DataFrame
class MySchema(pa.SchemaModel):
nullable_col_1: Series[float]
nullable_col_2: Series[float]
nullable_col_3: Series[float]
nullable_col_4: Series[float]
nullable_col_5: Series[float] = pa.Field(nullable=False)
nullable_col_6: Series[float]
class Config:
nullable = True
Hi @benlindsay. I agree repeating nullable
can be verbose and cumbersome.
Pandera strives to keep feature parity between SchemaModel
and DataFrameSchema
. So we would need to introduce a similar option to DataFrameSchema
. This is similar to the coerce
option that can be set at the schema level with both SchemaModel and DataFrameSchema apis. We should also have similar defaults for required
, and possibly unique
and allow_duplicates
. The default pandera.Check(ignore_na=True)
is also often asked about. Can you think of other candidates?
An alternative would be a global config similar to the pandas options. It allows global and local (via a contextmanager) defaults overriding.
## global, applies to all schemas
pandera.options.nullable = True
## local
with pandera.options("nullable", True):
class MySchema(pa.SchemaModel):
nullable_col_1: Series[float]
col_2: Series[float] = pa.Field(nullable=False)
One downside of a global config is that you would need to read both the schema definition and config to have the complete picture. The config could be stored in a separate python module, which could lead to surprising results for someone who only read the schema. I personally prefer the schema to be self-contained, without relying on side effects. On the upside, this mechanism avoids the proliferation of schema arguments.
An imperfect solution to reduce verbosity that you can apply right away:
from typing import Any
import pandera as pa
from pandera.typing import Series
## SchemaModel
def Nullable(*args: Any, **kwargs: Any) -> pa.Field:
kwargs["nullable"] = True
return pa.Field(*args, **kwargs)
class MySchema(pa.SchemaModel):
nullable_col_1: Series[float] = Nullable()
col_2: Series[float]
## DataFrameSchema
def NullableCol(dtype: Any, *args: Any, **kwargs: Any) -> pa.Column:
"""Column with nullable=True by default."""
kwargs["nullable"] = True
return pa.Column(dtype, *args, **kwargs)
schema = pa.DataFrameSchema(
{"nullable_col_1": NullableCol(float), "col_2": pa.Column(float)}
)
An imperfect solution to reduce verbosity that you can apply right away:
from typing import Any import pandera as pa from pandera.typing import Series ## SchemaModel def Nullable(*args: Any, **kwargs: Any) -> pa.Field: kwargs["nullable"] = True return pa.Field(*args, **kwargs) class MySchema(pa.SchemaModel): nullable_col_1: Series[float] = Nullable() col_2: Series[float] ## DataFrameSchema def NullableCol(dtype: Any, *args: Any, **kwargs: Any) -> pa.Column: """Column with nullable=True by default.""" kwargs["nullable"] = True return pa.Column(dtype, *args, **kwargs) schema = pa.DataFrameSchema( {"nullable_col_1": NullableCol(float), "col_2": pa.Column(float)} )
@jeffzi This one is a lifesaver, merci beacoup!
if you're into functools
, you can also do something like:
from functools import partial
import pandera as pa
NullableField = partial(pa.Field, nullable=True)
Would welcome a PR to add this option at the dataframe-schema level!
To open up a discussion about the semantics of options at the dataframe- and field- (column and index) level, the existing option of coerce
has the current behavior: coerce=True
will override any field with coerce=False
. This seems to be unintuitive to me... i.e. it seems like DataFrameSchema(..., coerce=True)
should define the default global setting and any options at more granular levels should override the global seetting.
Similarly, DataFrameSchema(..., nullable=True)
should define the global setting and Column(..., nullable=False)
should override that.
What do y'all think @benlindsay @jeffzi @vovavili ?
Would welcome a PR to add this option at the dataframe-schema level!
To open up a discussion about the semantics of options at the dataframe- and field- (column and index) level, the existing option of
coerce
has the current behavior:coerce=True
will override any field withcoerce=False
. This seems to be unintuitive to me... i.e. it seems likeDataFrameSchema(..., coerce=True)
should define the default global setting and any options at more granular levels should override the global seetting.Similarly,
DataFrameSchema(..., nullable=True)
should define the global setting andColumn(..., nullable=False)
should override that.What do y'all think @benlindsay @jeffzi @vovavili ?
@cosmicBboy That sounds like a dream option for me! Would save me so much time. Full support.
it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.
Agreed. To be accurate, it should be "any explicitly set options at more granular levels levels should override the global setting." coerce=False
is the default field option, if omitted the global coerce=True
should indeed override it. Technically, we'd need a sentinel value to differentiate unset arguments from defaults. See the unapproved Pep 661 - Sentinel values.
I ran against the need for this today as well. My output schemas are all coerce on input and nullable on all outputs across an API surface; I was going to make a base class for the output models but I couldn't do the output bit.
cool. @blais the pandera internals re-write is pretty much done (just have to clean up a few more things) but after that this feature should be fairly easy to support.
I might consider contributing to this one.
Could you briefly mention what file,classes should be modified/created. Not sure where to start from. @cosmicBboy
Thank you,
Thanks @aidiss !
So I think a reasonable approach here is to support a dataframe-level default that can be overriden at the schema-component (column or index level).
Here are the changes that need to be made:
- Add nullable option at
DataFrameSchema.__init__
which should be stored as aself.nullable
instance attribute. This should beNone
by default. - Need to change the default value of nullable in
{ArraySchema, SeriesSchema, Column, Index}.__init__
toNone
so that we can get the correct behavior in the point below. Will also need to turnself.nullable
into a private variableself._nullable
, and exposenullable
as a@property
, which returnsFalse
ifself._nullable
isNone
. This is so that we can distinguish between default behavior and user-provided values, as explained in the point below. - The df-level default should be propagated at validation-time so we don't risk change the state of the schema components, so basically the
DataFrameSchemaBackend.collect_schema_components
method needs to be updated so that the df-level value is set on theColumn._nullable
property of the copiedcol
object. However, ifColumn._nullable
is notNone
, i.e. that the user provided a value, then the df-level nullable value shouldn't be applied to the column. This works out nicely because theColumn.nullable
@property
method will default toFalse
ifColumn._nullable
is None. (btw, do we want to support nullability in the Index? In that case we'll need to propagate that logic to apply to the index schema component as well.) - The
nullable
attribute needs to be added to theBaseConfig
class for the class-based API. - Update the
kwargs
here, to include the new option. - Add tests for
DataFrameSchema
here and forDataFrameModel
here. - Update this docs page to include a subheading explaining the behavior of this new option.
Be sure to check out the contributing guide before you get started, and let me know if you have any questions! I'll be OOO for the next two weeks but can answer any questions you have after I get back from vacation.