Best way to integrate parsing logic
Question
What is the best way to integrate data cleaning/parsing logic with Pandera? I have included an example use case and my current solution below, but looking for feedback on other approaches etc.
Scenario
Let's say I have a dataframe, like this:
name, phone_number
"user1", "+11231231234"
"user2", "(123)-123 1234"
"user3", 1231231234
I want to:
- parse/process/clean the data
- validate that my processing worked
I can do # 2 using pandera by creating a schema with checks. To do # 1 I would simply use pandas operations. However, if the logic remains totally separate, I may end up having duplicate code, such as column names etc.
What I ended up doing was creating a custom Column class that allowed me to store parser functions next to a column:
def default_parser(series: pd.Series) -> pd.Series:
return series
class Column(pa.Column):
def __init__(
self,
*args: Any,
parsers: Optional[List[Callable[..., Any]]] = None,
**kwargs: Any
) -> None:
self.parsers = [parser.default_parser]
if parsers:
self.parsers = parsers
super().__init__(*args, **kwargs)
def parse(self, series: Any) -> pd.Series:
for column_parser in self.parsers:
series = column_parser(series)
return series
I then have the option to specify parsers for each column:
...
"phone_number": Column(
str, nullable=True, parsers=[parse_phone_numbers]
),
...
I can then optionally call the parsers before running validation:
for column in schema_columns:
series: pd.Series = dataframe[column]
dataframe[column] = schema_columns[column].parse(series)
schema.validate(dataframe)
I like this approach as it means I can store my expected schema next to operations required to achieve that state, which had been especially useful when dealing with files with 50+ column names. Technically I could just have processing logic and use tests on mock data, but by integrating with pandera I can validate/test against real data each time to ensure no edge cases were missed by a new unexpected data format (from a csv, for example).
It is also essentially a more advanced/custom version of the coerce option, which adds a parser to each column, only it does simple type conversion rather than cleaning etc.
Is there a better way of doing this? If not, is this something that would be considered for Pandera? I have seen in previous issues mentions that Pandera should be strictly for validation, not processing, but in this case Pandera isn't doing any processing, it's just providing a place to plugin custom processing functions.