pandera
pandera copied to clipboard
pandas_dtype should be support built-in collection types like list, dict, set
Is your feature request related to a problem? Please describe.
Pandas does not natively support a dtype representation for collections like lists, dicts, sets, and iterables. For these types the corresponding pandas data type is object. This may obfuscate the actual type of a column for users that rely on these types being present in a particular column.
Describe the solution you'd like
Extend the PandasDtype representation and support for list, dict, and set types.
- abstract away the type-check handling currently in
SeriesSchemaBase.validate - handle logic for checking the dtype for collection types, e.g. with `series.map(lambda x: x.isinstance(x, list))
It may be even nicer to support typing like:
List[int]Dict[str, int]
And for pandera to verify types like this.
Has this been in consideration?
Yes @jdvala, it still needs to be prioritized in the release roadmap, it's dependent on #369, which is a re-vamping of the pandera typing system, which should make this feature easier to implement.
#369 has been solved, what about this feature ?
Looks like https://github.com/unionai-oss/pandera/issues/369 is merged. What's the status of this?
@anantzoid current status is help wanted. Open to contribution!
Basically would require:
- creating new pandera datatypes (see here) that supports:
- lists:
list,List[...] - dictionaries:
dict,Dict[...] - etc.
- lists:
- adding unit tests for these
This would basically use object as the underlying pandas type, and using the logical data type system to check the actual values of the data_container to make sure the types are correct.
Note: based on this thread we also want the pandera datatype system to handle unhashable types (sets, lists)