pymc-marketing icon indicating copy to clipboard operation
pymc-marketing copied to clipboard

Pydantic for data validation?

Open juanitorduz opened this issue 1 year ago • 5 comments
trafficstars

At the end of https://github.com/pymc-labs/pymc-marketing/pull/498, we touched on a point it was been on my mind for a while now.

Shall we use pydantic for data validation? I have worked with Pydantic on many projects, and I love it! It is super fast and actively maintained! See for example the data generation process in https://juanitorduz.github.io/multilevel_elasticities_single_sku/ This would provide a modern and elegant way to validate data (input data and parameters). If we agree on doing it I would be happy to kick-off this initiative 😄 .

juanitorduz avatar Jan 26 '24 13:01 juanitorduz

Is it common for model parameters to be incorrectly specified? If not I think pydantic is overkill. It's great for data pipelines, but although pydantic can validate if an input is a pandas DataFrame, it can't validate the contents of that dataframe. Same with a dict for model_config.

ColtAllen avatar Jan 26 '24 18:01 ColtAllen

Actually I think you can add custom checks across the fields (eg the data frame). Look in the example I shared above I have something like

from pydantic import BaseModel, Field,  field_validator


class Region(BaseModel):
    id: int = Field(..., ge=0)
    stores: list[Store] = Field(..., min_items=1)
    median_income: float = Field(..., gt=0)

    @field_validator("stores")
    def validate_store_ids(cls, value):
        if len({store.id for store in value}) != len(value):
            raise ValueError("stores must have unique ids")
        return value

    def to_dataframe(self) -> pd.DataFrame:
        df = pd.concat([store.to_dataframe() for store in self.stores], axis=0)
        df["region_id"] = self.id
        df["median_income"] = self.median_income
        return df.reset_index(drop=True)

Which is a custom check :)

juanitorduz avatar Jan 26 '24 18:01 juanitorduz

I did look at it, and abandoned editing my previous post when you replied haha.

I've used pandera in the past for validating dataframes, but feel it's too specialized to add as a library requirement. In general I'm in favor of keeping requirements to a minimum and not adding any additional development overhead unless this is a significant problem we should go ahead and address?

On a related note, I created an issue to add a data validation utility method to the CLV module for users who provide their own RFM data, but I have other priorities at the moment.

ColtAllen avatar Jan 26 '24 18:01 ColtAllen

In general I'm in favor of keeping requirements to a minimum.

I also agree with this in general.

I think pydantic is a widely popular library so I don't think is as bad as a very niche one. Still, it is a fair point.

I think the problem we want to solve is to have a unified way for data and parameter validation. There is nothing wrong on how we are using it now, it is more about a nicer API. Still, I do not have a very strong option. I will investigate more and see if pydantic can bring us more benefits. I will also look into the data validation issue you mentioned.

Thanks for the feedback :)

juanitorduz avatar Jan 26 '24 18:01 juanitorduz

Pandera is great, it is as actively developed as pydantic

ferrine avatar Jan 28 '24 06:01 ferrine