pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Best way to integrate parsing logic

Open pwithams opened this issue 3 years ago • 0 comments

Question

What is the best way to integrate data cleaning/parsing logic with Pandera? I have included an example use case and my current solution below, but looking for feedback on other approaches etc.

Scenario

Let's say I have a dataframe, like this:

name, phone_number
"user1", "+11231231234"
"user2", "(123)-123 1234"
"user3", 1231231234

I want to:

  1. parse/process/clean the data
  2. validate that my processing worked

I can do # 2 using pandera by creating a schema with checks. To do # 1 I would simply use pandas operations. However, if the logic remains totally separate, I may end up having duplicate code, such as column names etc.

What I ended up doing was creating a custom Column class that allowed me to store parser functions next to a column:

def default_parser(series: pd.Series) -> pd.Series:
    return series

class Column(pa.Column):
    def __init__(
        self,
        *args: Any,
        parsers: Optional[List[Callable[..., Any]]] = None,
        **kwargs: Any
    ) -> None:
        self.parsers = [parser.default_parser]
        if parsers:
            self.parsers = parsers
        super().__init__(*args, **kwargs)

    def parse(self, series: Any) -> pd.Series:
        for column_parser in self.parsers:
            series = column_parser(series)
        return series

I then have the option to specify parsers for each column:

...
"phone_number": Column(
                str, nullable=True, parsers=[parse_phone_numbers]
            ),
...

I can then optionally call the parsers before running validation:

for column in schema_columns:
  series: pd.Series = dataframe[column]
   dataframe[column] = schema_columns[column].parse(series)
schema.validate(dataframe)

I like this approach as it means I can store my expected schema next to operations required to achieve that state, which had been especially useful when dealing with files with 50+ column names. Technically I could just have processing logic and use tests on mock data, but by integrating with pandera I can validate/test against real data each time to ensure no edge cases were missed by a new unexpected data format (from a csv, for example).

It is also essentially a more advanced/custom version of the coerce option, which adds a parser to each column, only it does simple type conversion rather than cleaning etc.

Is there a better way of doing this? If not, is this something that would be considered for Pandera? I have seen in previous issues mentions that Pandera should be strictly for validation, not processing, but in this case Pandera isn't doing any processing, it's just providing a place to plugin custom processing functions.

pwithams avatar Aug 31 '22 02:08 pwithams