pandantic Construct as a Pandas plugin

Hi! So I was thinking of making a very similar project with one core difference: having the validator function as a Pandas plugin that takes a Pydantic BaseModel or Dataclass as an input.

For example:

df.pandantic.validate(schema: pydantic.BaseModel | pydantic.dataclasses.dataclass)

See: https://pandas.pydata.org/docs/development/extending.html

Wondering what you think about this refactor? I like the idea of being more agnostic to the type of Pydantic schema object being passed in, as Dataclasses are more analogous to a pandas data frame.

Additionally, it allows one to import and use normal Pydantic, instead of a wrapper. Normal pandas can be used too given the plugin is imported.

If you are amenable to this idea, I am happy to make a PR. Otherwise I may just make my own project pandas-pydantic. I would keep your logic largely the same, and test whether it works with dataclasses as well.

Sep 02 '23 18:09 xaviernogueira

Another option would be to create the pandas plugin from a shared set of functions such that either pattern works. This could be a good option to preserve backwards compatibility, and on second thought may be best. Thoughts? @wesselhuising

Sep 03 '23 21:09 xaviernogueira

Hi @xaviernogueira ,

Thank you for your interest and taking the time to write the suggestions down. I was not aware of the extending functionality of pandas, which is indeed nice as it doesn't need to be a fork of any of the two dependancies (currently).

The only challenge I feel is that by doing so, you would devote the whole project just to one DataFrame package (in this case pandas). The ambition is to be agnostic to the type of dataframe (let's say polars or spark dfs for example) and not be agnostic to the schema object type (in this case dataclasses). How would you suggest approaching this ambition?

Sep 05 '23 10:09 wesselhuising

Hi @wesselhuising thanks for the response! That is a valid point, I did not realize that was your roadmap. IMO being agnostic on both sides (schema and data frame) is probably best, mainly because the schema implementations are relatively similar.

Regarding implementation, I still think that inheriting from BaseModel is not the ideal approach. Idk if you are familiar with Python Protocols and dependency injection as a concept, but this is a classic use case for it!

I would start by defining the protocol for both dataclass and basemodel validation. The way to think about this is that we are defining an interface that we can expect. A static type checker will make sure that the class being referenced where one of the protocols is contains the expected function signature (see here).

# shared_types.py ... or something like that
import typing
import pandas as pf
import polars
import pydantic

DataFrameTypes = typing.Union[pd.DataFrame, polars.DataFrame]
SchemaTypes = typing.Union[pydantic.BaseModel, pydantic.dataclasses.dataclass]

@typing.runtime_checkable # prevents non-protocol classes from being used at runtime
class SupportsValidation(typing.Protocol):
     
     def dataclass_validate(self, schema: pydantic.dataclasses.dataclass, df: DataFrameTypes):
          ...
     def model_validate(self, schema: pydantic.BaseModel, df: DataFrameTypes):
          ...

Next, in a different file, I would define a function that either is initialized with a schema, and takes any dataframe type as an argument for validate(). This class will have the responsibility of fetching the correct protocol implementation for each class. See below.

# validator.py
from shared_types import (
     DataFrameTypes,
     SchemaTypes,
     SupportsValidation,
)

class DataFrameValidator:
     def __init__(self, schema: SchemaTypes, ...)
          self.schema = schema
          ...

     @property
     def validator_function(self):
          if issubclass(self.schema, pydantic.BaseModel):
                 return 'model_validate'
          elif issubclass(self.schema, pydantic.dataclasses.dataclass):
                 return 'dataclass_validate'
          else: raise TypeError(...)
           ...

     @staticmethod
     def get_implementation(df: DataFrameTypes) -> SupportsValidation:
           """"Returns the dataclass library specific class that meets at least one protocol"""
           ...

     def validate(self, df: DataFrameTypes, ....)
          implementation: SupportsValidation = self.get_implementation(df)
          return getattr(implementation, self.validator_function)

That would be it basically! So then all your existing code would be in a pandas "implementation of the SupportsValidation protocol.

Advantages:

A single simple user interface via DataFrameValidator.
Can arbitrarily add more implementations for any new table library.
Additionally (I didn't show this here), but we could add a method that allows the user to pass in (aka "inject") their own implementation of the SupportsValidation protocol (like a override_validator kwarg for validate(). Therefore the framework is somewhat plugin-able.
Your code can be re-used, just moved to a pandas implementation of SupportsValidation.
One can also validate different dataframe types across one class instance. Let's say you have a list of mixed dataframe types, you can just pass them all into validate() freely.

Thoughts? I am happy to hop on a call with you at some point if you are interested in making this happen. I think this is a very useful library you have, and it deserves to be well-structured for expansion!

Sep 05 '23 16:09 xaviernogueira

Hi @xaviernogueira ,

Thank you again for your in-depth reply. I definitely agree with the fact that the dependency injection of Pydantic is not ideal, I wanted to mimic the parse_obj method from their API by creating the parse_df method, but the result is indeed that the package is more like a fork over a stand-alone package. So I definitely are open to a refactor like the one you proposed.

The only thing is that by adding a class like DataFrameValidator is an extra import, where the current approach does only need one import as it is a subclass of the BaseModel.

I would like to have a call and pick your brain on this, it is something I think we can look into as creating another package sounds cumbersome to me. Can you add me on LinkedIn?

Sep 07 '23 07:09 wesselhuising

Adding you! Managed to catch covid and feel terrible so let me get back to you in a few days. @wesselhuising

Sep 09 '23 16:09 xaviernogueira

Work on this has restarted!

Note for pandas you can have a DataArray, Series, and Index plugin. I will implement DataArray first, then Series assuming a string index. This is relevant if you want to use pydantic models to validate a specific row, or a column oriented set of values (with the index as the names).

Aug 27 '24 16:08 xaviernogueira

This was implemented w/ PR #22 ! Closing.

Sep 03 '24 19:09 xaviernogueira

pandantic pandantic copied to clipboard

Construct as a Pandas plugin

pandantic
pandantic copied to clipboard