pandera
pandera copied to clipboard
Column metadata
Enjoying the Pandera package a lot!
I was wondering if a possibility for column metadata could be added?
A clear example would be validating units of measurements. For instance, in a transformation pipeline you might merge two dataframes and sum two distance columns (one distance column from each separate dataframe, then summed after merging the two dataframes). It would be good that Pandera automatically checks if the units of these columns match through checking the metadata. If not, raise an error.
There are probably more situations where metadata comparison could be fruitful, but this one would be quite handy with scientific research.
hi @TheDataScientistNL at a high level this makes sense! To make things more concrete, would you be able to provide a toy dataset, and perhaps the checks you want to run (written manually with pandas/python)? This'll help me understand how a potential new feature would fit into pandera.
Enjoying the Pandera package a lot!
And thank you 😀
Hi - I was about to log the same issue. The use in my example is that I want to add a 'version' tag to the column. New columns are added to this dataset over time (as they come from external partners, which can be unpredictable). Therefore, being able to access the tags during data processing, adds a lot of flexibility.
Therefore, in " columnname: Column(Float), " , could a suggestion be that the user optionally replace the column with a dictionary , but in that instance that Pandera will always read the dict entry labelled 'Column' or 'column' for accessing pa.Column ?
And therefore the user can optionally extend per 'version' dictionary entries, TheDataScientistNL's units use case, etc.?
Over time potentially common use cases could be more generally supported however it provide the means to extend as required.
Thanks heaps!!
@cosmicBboy Pardon the delay.
Herewith an example.
Suppose we have two datasets
` import pandas as pd
df = pd.DataFrame({'person': ['John Doe', 'Rick Ashley'}, 'weight': [205.2, 150.6], 'date':[2021-01-01, 2021-02-01]) `
Suppose I have the above dataframe, in which the column 'weight' contains the weight of the person in units 'lbs'. Then it would be nice to have a schema in where I could explicitly indicate the column's unit, and allow for conversion to e.g. SI units.
So we would get something like
import pandera as pa
from pandera.typing import Series, DataFrame
class PersonWeight(pa.SchemaModel):
person: Series[str]
date: Series[date] = pa.Field(coerce=True)
weight: Series[float] = pa.Field(coerce=True, convert=True)
class Units:
weight: ('lbs', 'kg')
Now, I don't know if this is the best way to approach this problem, but at least it shows my wish. Indeed, we see that from the PersonWeight class that the unit of column 'weight' is initially 'lbs', and that for the output we wish a weight unit of 'kg'. The 'convert=True' in the Field indicates that I wish for column unit conversion, as defined in the subclass Units.
Conversion of such can be easily added. One could create a base class for UnitConversion within pandera that takes care of the conversion, have a couple that are used often in science (for dimensions weight, time, distance, speed, etc.), and allow users to create their own if very specific units are used. What is nice is that we now have explicitly defined what our unit is. And if we require no conversion, we could simply write in the Units subclass.
weight: ('lbs', 'lbs')
Perhaps there is a more elegant way to approach this, but I hope the wish is more clear now.
UPDATE: I see now that there already exists a package for adding units to pandas columns, and conversion. Would it be possible to integrate pandera with pint-pandas? See https://github.com/hgrecco/pint-pandas
I'm subscribing to this topic since I'm about to tackle this exact issue.
TL;DR: following the frictionless data table schema proposal, a "unit" attribute in the Field
would be much appreciated.
class PersonWeight(pa.SchemaModel):
person: Series[str]
date: Series[date] = pa.Field(coerce=True)
weight: Series[float] = pa.Field(coerce=True, convert=True, unit="kg")
or a more general "metadata" attribute like dataclasses.field
metadata
argument:
class PersonWeight(pa.SchemaModel):
person: Series[str]
date: Series[date] = pa.Field(coerce=True)
weight: Series[float] = pa.Field(coerce=True, convert=True, metadata={"unit": "kg"})
Long version:
My use case would be to access that metadata to perform the unit conversion with pint after schema.validate()
.
This is because pint-pandas is not production-ready and has many issues when trying to manipulate pintarray dtypes.
I already have a function to import excel table that have the unit of measure in the second row, get the multiplier to use to transform those units to the target units (used implicitly on the rest of the code) and apply that multiplier. This allows me to have quantity validation at the public facing level, but avoid pint related overhead in the private functions.
The current configuration is provided via a dictionary that contains, along with the other things (that can already be replaced by the DataFrameSchema/Model), the said target units.
Having the ability to add the units as a Field
parameter would allow me to drop my custom configuration schema and go all in with pandera objects.