pymc-marketing
pymc-marketing copied to clipboard
Pydantic for data validation?
At the end of https://github.com/pymc-labs/pymc-marketing/pull/498, we touched on a point it was been on my mind for a while now.
Shall we use pydantic for data validation?
I have worked with Pydantic on many projects, and I love it! It is super fast and actively maintained! See for example the data generation process in https://juanitorduz.github.io/multilevel_elasticities_single_sku/
This would provide a modern and elegant way to validate data (input data and parameters). If we agree on doing it I would be happy to kick-off this initiative 😄 .
Is it common for model parameters to be incorrectly specified? If not I think pydantic is overkill. It's great for data pipelines, but although pydantic can validate if an input is a pandas DataFrame, it can't validate the contents of that dataframe. Same with a dict for model_config.
Actually I think you can add custom checks across the fields (eg the data frame). Look in the example I shared above I have something like
from pydantic import BaseModel, Field, field_validator
class Region(BaseModel):
id: int = Field(..., ge=0)
stores: list[Store] = Field(..., min_items=1)
median_income: float = Field(..., gt=0)
@field_validator("stores")
def validate_store_ids(cls, value):
if len({store.id for store in value}) != len(value):
raise ValueError("stores must have unique ids")
return value
def to_dataframe(self) -> pd.DataFrame:
df = pd.concat([store.to_dataframe() for store in self.stores], axis=0)
df["region_id"] = self.id
df["median_income"] = self.median_income
return df.reset_index(drop=True)
Which is a custom check :)
I did look at it, and abandoned editing my previous post when you replied haha.
I've used pandera in the past for validating dataframes, but feel it's too specialized to add as a library requirement. In general I'm in favor of keeping requirements to a minimum and not adding any additional development overhead unless this is a significant problem we should go ahead and address?
On a related note, I created an issue to add a data validation utility method to the CLV module for users who provide their own RFM data, but I have other priorities at the moment.
In general I'm in favor of keeping requirements to a minimum.
I also agree with this in general.
I think pydantic is a widely popular library so I don't think is as bad as a very niche one. Still, it is a fair point.
I think the problem we want to solve is to have a unified way for data and parameter validation. There is nothing wrong on how we are using it now, it is more about a nicer API. Still, I do not have a very strong option. I will investigate more and see if pydantic can bring us more benefits. I will also look into the data validation issue you mentioned.
Thanks for the feedback :)
Pandera is great, it is as actively developed as pydantic