pandera
pandera copied to clipboard
[Docs] Better documentation on data type overriding behavior
Location of the documentation
https://pandera.readthedocs.io/en/stable/dtypes.html
Documentation problem
The way that the pandera datatype system should be better documented, as described by @jeffzi: https://discord.com/channels/897120336003334214/1007034449441071164/1010851299992018998
re: dtype-overriding. The engine maintains a dict lookup (equivalent object -> DataType) where equivalent object can be any python object. In practice it's a string alias, a class (e.g. datetime.datetime) or an instance for convenience (e.g dtypes.Timestamp()). There can be only one entry equivalent object, so any DataType that re-uses an equivalent object will override the lookup. Overriding is in the order of the definitions. Built-in DataTypes will be first since there are loaded with pandera, then it depends on the loading order for user-defined DataTypes. That mechanism allows to selectively override equivalences.
@brogrammer89 If you repeat the same equivalents as the built-in DateTime you will completely override the built-in behavior. You can reuse the default behavior by calling super() in the methods you override, since you inherit from pandas_engine.DateTime.
Using a new alias will have no effects. The engine eventually falls back to pandas.api.type.pandas_dtype to infer the data types in case of a string alias. See https://github.com/unionai-oss/pandera/blob/7399dd41467309251b2052df4457be6426d0b163/pandera/engines/pandas_engine.py#L171 If your new alias is not a known pandas alias, the engine will fail to find a match in its registry of equivalent objects. However, when given a pandera DataType, the engine takes it at face value and do not attempt to infer anything (see https://github.com/unionai-oss/pandera/blob/7399dd41467309251b2052df4457be6426d0b163/pandera/engines/engine.py#L186)
Therefore, I would advise to skip the @register_dtype decorator and use your new type in your schema:
import pandera as pa
from pandera.typing import Series
from pandera.engines import pandas_engine
from pandera import dtypes
@dtypes.immutable
class MyDateTime(pandas_engine.DateTime):
pass
class Schema(pa.SchemaModel):
LAST_ALTERED: Series[MyDateTime]
type(Schema.to_schema().dtypes["LAST_ALTERED"]) # MyDateTime
Suggested fix for documentation
The documentation should include a sub-header describing how th datatype system works in relation to how equivalents are overriden.