pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Add special case to check values of `str` column

Open cosmicBboy opened this issue 2 years ago • 0 comments

Describe the bug

FYI @jeffzi @dineshkumar-23

The bug is clearly described here. Basically, since str dtype arrays are translated to a numpy object arrays, any object can exist within such a column and still pass validation.

There is now pandas.StringDtype since pandas > 1.0, but I think it's still important to special-case this type because (i) many users may not be aware of it and (ii) I think pandera should start getting into the business of correcting some of pandas' quirks, esp. when it comes to the type system.

The special-casing should be implemented at the DataType definition (i.e. pandera.engines.numpy_engine.String) after we have an API for logical data types https://github.com/pandera-dev/pandera/pull/798.

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of pandera.
  • [X] (optional) I have confirmed this bug exists on the master branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

See https://github.com/pandera-dev/pandera/discussions/807

Expected behavior

Failure cases of non-string objects in a numpy object array (aka a string column) should be correctly reported.

cosmicBboy avatar Mar 29 '22 12:03 cosmicBboy