pandantic
pandantic copied to clipboard
Construct as a Pandas plugin
Hi! So I was thinking of making a very similar project with one core difference: having the validator function as a Pandas plugin that takes a Pydantic BaseModel or Dataclass as an input.
For example:
df.pandantic.validate(schema: pydantic.BaseModel | pydantic.dataclasses.dataclass)
See: https://pandas.pydata.org/docs/development/extending.html
Wondering what you think about this refactor? I like the idea of being more agnostic to the type of Pydantic schema object being passed in, as Dataclasses are more analogous to a pandas data frame.
Additionally, it allows one to import and use normal Pydantic, instead of a wrapper. Normal pandas can be used too given the plugin is imported.
If you are amenable to this idea, I am happy to make a PR. Otherwise I may just make my own project pandas-pydantic
. I would keep your logic largely the same, and test whether it works with dataclasses as well.
Another option would be to create the pandas plugin from a shared set of functions such that either pattern works. This could be a good option to preserve backwards compatibility, and on second thought may be best. Thoughts? @wesselhuising
Hi @xaviernogueira ,
Thank you for your interest and taking the time to write the suggestions down. I was not aware of the extending functionality of pandas, which is indeed nice as it doesn't need to be a fork of any of the two dependancies (currently).
The only challenge I feel is that by doing so, you would devote the whole project just to one DataFrame package (in this case pandas). The ambition is to be agnostic to the type of dataframe (let's say polars or spark dfs for example) and not be agnostic to the schema object type (in this case dataclasses). How would you suggest approaching this ambition?
Hi @wesselhuising thanks for the response! That is a valid point, I did not realize that was your roadmap. IMO being agnostic on both sides (schema and data frame) is probably best, mainly because the schema implementations are relatively similar.
Regarding implementation, I still think that inheriting from BaseModel
is not the ideal approach. Idk if you are familiar with Python Protocols and dependency injection as a concept, but this is a classic use case for it!
I would start by defining the protocol for both dataclass and basemodel validation. The way to think about this is that we are defining an interface that we can expect. A static type checker will make sure that the class being referenced where one of the protocols is contains the expected function signature (see here).
# shared_types.py ... or something like that
import typing
import pandas as pf
import polars
import pydantic
DataFrameTypes = typing.Union[pd.DataFrame, polars.DataFrame]
SchemaTypes = typing.Union[pydantic.BaseModel, pydantic.dataclasses.dataclass]
@typing.runtime_checkable # prevents non-protocol classes from being used at runtime
class SupportsValidation(typing.Protocol):
def dataclass_validate(self, schema: pydantic.dataclasses.dataclass, df: DataFrameTypes):
...
def model_validate(self, schema: pydantic.BaseModel, df: DataFrameTypes):
...
Next, in a different file, I would define a function that either is initialized with a schema, and takes any dataframe type as an argument for validate()
. This class will have the responsibility of fetching the correct protocol implementation for each class. See below.
# validator.py
from shared_types import (
DataFrameTypes,
SchemaTypes,
SupportsValidation,
)
class DataFrameValidator:
def __init__(self, schema: SchemaTypes, ...)
self.schema = schema
...
@property
def validator_function(self):
if issubclass(self.schema, pydantic.BaseModel):
return 'model_validate'
elif issubclass(self.schema, pydantic.dataclasses.dataclass):
return 'dataclass_validate'
else: raise TypeError(...)
...
@staticmethod
def get_implementation(df: DataFrameTypes) -> SupportsValidation:
""""Returns the dataclass library specific class that meets at least one protocol"""
...
def validate(self, df: DataFrameTypes, ....)
implementation: SupportsValidation = self.get_implementation(df)
return getattr(implementation, self.validator_function)
That would be it basically! So then all your existing code would be in a pandas "implementation of the SupportsValidation
protocol.
Advantages:
- A single simple user interface via
DataFrameValidator
. - Can arbitrarily add more implementations for any new table library.
- Additionally (I didn't show this here), but we could add a method that allows the user to pass in (aka "inject") their own implementation of the
SupportsValidation
protocol (like aoverride_validator
kwarg forvalidate()
. Therefore the framework is somewhat plugin-able. - Your code can be re-used, just moved to a pandas implementation of
SupportsValidation
. - One can also validate different dataframe types across one class instance. Let's say you have a list of mixed dataframe types, you can just pass them all into
validate()
freely.
Thoughts? I am happy to hop on a call with you at some point if you are interested in making this happen. I think this is a very useful library you have, and it deserves to be well-structured for expansion!
Hi @xaviernogueira ,
Thank you again for your in-depth reply. I definitely agree with the fact that the dependency injection of Pydantic is not ideal, I wanted to mimic the parse_obj
method from their API by creating the parse_df
method, but the result is indeed that the package is more like a fork
over a stand-alone package
. So I definitely are open to a refactor like the one you proposed.
The only thing is that by adding a class like DataFrameValidator
is an extra import, where the current approach does only need one import as it is a subclass of the BaseModel
.
I would like to have a call and pick your brain on this, it is something I think we can look into as creating another package sounds cumbersome to me. Can you add me on LinkedIn?
Adding you! Managed to catch covid and feel terrible so let me get back to you in a few days. @wesselhuising
Work on this has restarted!
Note for pandas you can have a DataArray
, Series
, and Index
plugin. I will implement DataArray
first, then Series
assuming a string index. This is relevant if you want to use pydantic
models to validate a specific row, or a column oriented set of values (with the index as the names).
This was implemented w/ PR #22 ! Closing.