pandera icon indicating copy to clipboard operation
pandera copied to clipboard

pandas_dtype should be support built-in collection types like list, dict, set

Open cosmicBboy opened this issue 5 years ago • 3 comments

Is your feature request related to a problem? Please describe.

Pandas does not natively support a dtype representation for collections like lists, dicts, sets, and iterables. For these types the corresponding pandas data type is object. This may obfuscate the actual type of a column for users that rely on these types being present in a particular column.

Describe the solution you'd like

Extend the PandasDtype representation and support for list, dict, and set types.

  • abstract away the type-check handling currently in SeriesSchemaBase.validate
  • handle logic for checking the dtype for collection types, e.g. with `series.map(lambda x: x.isinstance(x, list))

It may be even nicer to support typing like:

  • List[int]
  • Dict[str, int]

And for pandera to verify types like this.

cosmicBboy avatar Aug 16 '20 14:08 cosmicBboy

Has this been in consideration?

jdvala avatar Feb 24 '21 09:02 jdvala

Yes @jdvala, it still needs to be prioritized in the release roadmap, it's dependent on #369, which is a re-vamping of the pandera typing system, which should make this feature easier to implement.

cosmicBboy avatar Feb 24 '21 13:02 cosmicBboy

#369 has been solved, what about this feature ?

exitNA avatar Apr 21 '22 11:04 exitNA

Looks like https://github.com/unionai-oss/pandera/issues/369 is merged. What's the status of this?

anantzoid avatar Oct 20 '22 20:10 anantzoid

@anantzoid current status is help wanted. Open to contribution!

Basically would require:

  • creating new pandera datatypes (see here) that supports:
    • lists: list, List[...]
    • dictionaries: dict, Dict[...]
    • etc.
  • adding unit tests for these

This would basically use object as the underlying pandas type, and using the logical data type system to check the actual values of the data_container to make sure the types are correct.

cosmicBboy avatar Oct 24 '22 20:10 cosmicBboy

Note: based on this thread we also want the pandera datatype system to handle unhashable types (sets, lists)

cosmicBboy avatar Nov 04 '22 14:11 cosmicBboy