pandera icon indicating copy to clipboard operation
pandera copied to clipboard

add "nullable" option to SchemaModel Config class

Open benlindsay opened this issue 2 years ago • 12 comments

I have a situation where almost all of the columns in my schemas are nullable, and it would be nice to set nullable = True as a config option instead of setting nullable=True for every column. For example, instead of this:

import pandera as pa
from pandera.typing import Series, DataFrame

class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float] = pa.Field(nullable=True)
    nullable_col_2: Series[float] = pa.Field(nullable=True)
    nullable_col_3: Series[float] = pa.Field(nullable=True)
    nullable_col_4: Series[float] = pa.Field(nullable=True)
    nullable_col_5: Series[float]
    nullable_col_6: Series[float] = pa.Field(nullable=True)

I'd love to be able to do this or something like it:

import pandera as pa
from pandera.typing import Series, DataFrame

class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float]
    nullable_col_2: Series[float]
    nullable_col_3: Series[float]
    nullable_col_4: Series[float]
    nullable_col_5: Series[float] = pa.Field(nullable=False)
    nullable_col_6: Series[float]

    class Config:
        nullable = True

benlindsay avatar Jan 20 '22 20:01 benlindsay

Hi @benlindsay. I agree repeating nullable can be verbose and cumbersome.

Pandera strives to keep feature parity between SchemaModel and DataFrameSchema. So we would need to introduce a similar option to DataFrameSchema. This is similar to the coerce option that can be set at the schema level with both SchemaModel and DataFrameSchema apis. We should also have similar defaults for required, and possibly unique and allow_duplicates. The default pandera.Check(ignore_na=True) is also often asked about. Can you think of other candidates?

An alternative would be a global config similar to the pandas options. It allows global and local (via a contextmanager) defaults overriding.


## global, applies to all schemas

pandera.options.nullable = True 

## local

with pandera.options("nullable", True):

    class MySchema(pa.SchemaModel):
        nullable_col_1: Series[float]
        col_2: Series[float] = pa.Field(nullable=False)

One downside of a global config is that you would need to read both the schema definition and config to have the complete picture. The config could be stored in a separate python module, which could lead to surprising results for someone who only read the schema. I personally prefer the schema to be self-contained, without relying on side effects. On the upside, this mechanism avoids the proliferation of schema arguments.

An imperfect solution to reduce verbosity that you can apply right away:

from typing import Any
import pandera as pa
from pandera.typing import Series

## SchemaModel


def Nullable(*args: Any, **kwargs: Any) -> pa.Field:
    kwargs["nullable"] = True
    return pa.Field(*args, **kwargs)


class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float] = Nullable()
    col_2: Series[float]


## DataFrameSchema


def NullableCol(dtype: Any, *args: Any, **kwargs: Any) -> pa.Column:
    """Column with nullable=True by default."""
    kwargs["nullable"] = True
    return pa.Column(dtype, *args, **kwargs)


schema = pa.DataFrameSchema(
    {"nullable_col_1": NullableCol(float), "col_2": pa.Column(float)}
)

jeffzi avatar Jan 21 '22 10:01 jeffzi

An imperfect solution to reduce verbosity that you can apply right away:

from typing import Any
import pandera as pa
from pandera.typing import Series

## SchemaModel


def Nullable(*args: Any, **kwargs: Any) -> pa.Field:
    kwargs["nullable"] = True
    return pa.Field(*args, **kwargs)


class MySchema(pa.SchemaModel):
    nullable_col_1: Series[float] = Nullable()
    col_2: Series[float]


## DataFrameSchema


def NullableCol(dtype: Any, *args: Any, **kwargs: Any) -> pa.Column:
    """Column with nullable=True by default."""
    kwargs["nullable"] = True
    return pa.Column(dtype, *args, **kwargs)


schema = pa.DataFrameSchema(
    {"nullable_col_1": NullableCol(float), "col_2": pa.Column(float)}
)

@jeffzi This one is a lifesaver, merci beacoup!

vovavili avatar Jan 27 '22 22:01 vovavili

if you're into functools, you can also do something like:

from functools import partial
import pandera as pa

NullableField = partial(pa.Field, nullable=True)

cosmicBboy avatar Mar 26 '22 16:03 cosmicBboy

Would welcome a PR to add this option at the dataframe-schema level!

To open up a discussion about the semantics of options at the dataframe- and field- (column and index) level, the existing option of coerce has the current behavior: coerce=True will override any field with coerce=False. This seems to be unintuitive to me... i.e. it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.

Similarly, DataFrameSchema(..., nullable=True) should define the global setting and Column(..., nullable=False) should override that.

What do y'all think @benlindsay @jeffzi @vovavili ?

cosmicBboy avatar Mar 26 '22 16:03 cosmicBboy

Would welcome a PR to add this option at the dataframe-schema level!

To open up a discussion about the semantics of options at the dataframe- and field- (column and index) level, the existing option of coerce has the current behavior: coerce=True will override any field with coerce=False. This seems to be unintuitive to me... i.e. it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.

Similarly, DataFrameSchema(..., nullable=True) should define the global setting and Column(..., nullable=False) should override that.

What do y'all think @benlindsay @jeffzi @vovavili ?

@cosmicBboy That sounds like a dream option for me! Would save me so much time. Full support.

vovavili avatar Mar 28 '22 05:03 vovavili

it seems like DataFrameSchema(..., coerce=True) should define the default global setting and any options at more granular levels should override the global seetting.

Agreed. To be accurate, it should be "any explicitly set options at more granular levels levels should override the global setting." coerce=False is the default field option, if omitted the global coerce=True should indeed override it. Technically, we'd need a sentinel value to differentiate unset arguments from defaults. See the unapproved Pep 661 - Sentinel values.

jeffzi avatar Mar 28 '22 17:03 jeffzi

I ran against the need for this today as well. My output schemas are all coerce on input and nullable on all outputs across an API surface; I was going to make a base class for the output models but I couldn't do the output bit.

blais avatar Feb 22 '23 22:02 blais

cool. @blais the pandera internals re-write is pretty much done (just have to clean up a few more things) but after that this feature should be fairly easy to support.

cosmicBboy avatar Mar 09 '23 18:03 cosmicBboy

I might consider contributing to this one.

Could you briefly mention what file,classes should be modified/created. Not sure where to start from. @cosmicBboy

aidiss avatar May 10 '23 10:05 aidiss

Thank you,

blais avatar May 11 '23 01:05 blais

Thanks @aidiss !

So I think a reasonable approach here is to support a dataframe-level default that can be overriden at the schema-component (column or index level).

Here are the changes that need to be made:

  • Add nullable option at DataFrameSchema.__init__ which should be stored as a self.nullable instance attribute. This should be None by default.
  • Need to change the default value of nullable in {ArraySchema, SeriesSchema, Column, Index}.__init__ to None so that we can get the correct behavior in the point below. Will also need to turn self.nullable into a private variable self._nullable, and expose nullable as a @property, which returns False if self._nullable is None. This is so that we can distinguish between default behavior and user-provided values, as explained in the point below.
  • The df-level default should be propagated at validation-time so we don't risk change the state of the schema components, so basically the DataFrameSchemaBackend.collect_schema_components method needs to be updated so that the df-level value is set on the Column._nullable property of the copied col object. However, if Column._nullable is not None, i.e. that the user provided a value, then the df-level nullable value shouldn't be applied to the column. This works out nicely because the Column.nullable @property method will default to False if Column._nullable is None. (btw, do we want to support nullability in the Index? In that case we'll need to propagate that logic to apply to the index schema component as well.)
  • The nullable attribute needs to be added to the BaseConfig class for the class-based API.
  • Update the kwargs here, to include the new option.
  • Add tests for DataFrameSchema here and for DataFrameModel here.
  • Update this docs page to include a subheading explaining the behavior of this new option.

Be sure to check out the contributing guide before you get started, and let me know if you have any questions! I'll be OOO for the next two weeks but can answer any questions you have after I get back from vacation.

cosmicBboy avatar May 11 '23 20:05 cosmicBboy