pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Add support for logical data types

Open jeffzi opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe.

One limitation with the current DataType.check() method is that it only validates the data type, without access to the data itself. I've ran into this when implementing decimal and date dtypes, which is why I've been holding back my PR.

I'm quoting the documenation of the visions lib (a data type framework):

Physical types represent the actual, underlying representation of the data. Logical types represent the abstracted understanding of that data.

According to those definitions, pandera DataTypes represent physical types, i.e. have a 1-1 relationship with pandas dtypes.

When I designed DataType, my position was that Check should be used if data is necessary to validate. However, I now think there are cases where it makes sense for DataType.check to receive data as well:

  1. Dtypes unofficially supported by pandas: date, decimal. Those are understood by Pyarrow, which is especially useful when writing to Parquet. In theory, the object can contain any types, but we'd need to look at the data to validate. Another example is the new PydanticModel introduced in #779
  2. Logical types: IP, URLs, Paths, etc. Technically they can be emulated by a function that returns a pandera.Column. I find it unintuitive, and it would be awkward to introduce new column types in the public api. Moreover, coercion is not available with this technique, even if I don't have a concrete use-case to present. e.g.:
def IPColumn(**kwargs:Any) -> pa.Column:
  """Re-usable helper to check columns containing IP addresses.""
  checks = kwargs["checks"] or []
  checks.append(pa.Check(is_valid_ip)) # assume we have is_ip_valid function
  kwargs["checks"] = checks
  return pa.Column(pa.STRING, **kwargs)

Describe the solution you'd like

We could add an optional argument data_container -> def check(self, pandera_dtype: DataType, data_container: Optional[Any]=None) -> bool. When schema validation takes place we always have access to the column data, it would be easy to pass it to check and let the DataType use it if needed. Current data types will simply ignore the new argument.

jeffzi avatar Mar 09 '22 23:03 jeffzi

this use case makes sense, ping me when you need a review 👀!

cosmicBboy avatar Mar 11 '22 18:03 cosmicBboy

Will do !

jeffzi avatar Mar 11 '22 19:03 jeffzi