pandera icon indicating copy to clipboard operation
pandera copied to clipboard

pa.infer_schema throws TypeError when column is null and dtype is object

Open will-gp opened this issue 3 years ago • 3 comments

Describe the bug pa.infer_schema(df) throws a TypeError when a given column contains all nulls and the column dtype is object. This occurs within _get_array_type in schema_statistics.py. Specifically, if a column contains all null values, the call to infer_dtype on line 183 will return 'empty'. The subsequent call to pandas_engine.Engine.dtype(inferred_alias) will throw a TypeError. Note: This only occurs when the original column dtype is object.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/USER/miniconda3/envs/piggy/lib/python3.7/site-packages/pandera/schema_statistics.py", line 184, in _get_array_type
    data_type = pandas_engine.Engine.dtype(inferred_alias)
  File "/home/USER/miniconda3/envs/piggy/lib/python3.7/site-packages/pandera/engines/pandas_engine.py", line 147, in dtype
    np_or_pd_dtype = pd.api.types.pandas_dtype(data_type)
  File "/home/USER/miniconda3/envs/piggy/lib/python3.7/site-packages/pandas/core/dtypes/common.py", line 1776, in pandas_dtype
    npdtype = np.dtype(dtype)
TypeError: data type 'empty' not understood
  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of pandera.
  • [x] (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
from pandera import schema_statistics

a = pd.Series([np.nan, np.nan])
# Change type to object
a = a.astype('object')

# This will throw type error
stats = schema_statistics._check_array_type(a)

Expected behavior

Given that infer_schema should make its best guess as to the schema, would it not make sense for it to do so, instead of throwing a TypeError?

We can actually use the null information during schema inference time to provide additional information. At the moment, null values are ignored during inference, as per line 183. However, the following calls to infer_dtype would produce a dtype estimate if skipna=False (and the original column is of dtype object):

[NaT, NaT, ..., NaT] --> would be inferred as datetime [np.nan, np.nan, ..., np.nan] --> would be inferred as floating point [None,..., None] --> would be inferred as mixed

I understand that this would change the expected behaviour of the schema inference function. It may make sense to have a second call to infer_drype with skipna=False if the initial call turns back 'empty'. Happy to discuss this as well as take this on once decided.

Desktop (please complete the following information):

  • OS: Debian 10 (Buster)
  • Python 3.7.2

will-gp avatar Apr 18 '22 18:04 will-gp

Hi @will-gp thanks for the bug report, I do think the use case you describe should be supported.

However, the following calls to infer_dtype would produce a dtype estimate if skipna=False (and the original column is of dtype object)

+1 to skipna=False

Feel free to make changes to the schema_statistics module and add some unit tests for these cases:

[NaT, NaT, ..., NaT] --> would be inferred as datetime [np.nan, np.nan, ..., np.nan] --> would be inferred as floating point [None,..., None] --> would be inferred as mixed

It may make sense to have a second call to infer_drype with skipna=False if the initial call turns back 'empty'.

Can you elaborate on this point?

Let me know if you have any questions, and thanks in advance for your contribution!

cosmicBboy avatar Apr 19 '22 01:04 cosmicBboy

Hey @cosmicBboy , my initial thought was that we'd first have a call to infer_dtype with skipna=True (the original logic). If this returns a dtype of empty we would then make a second call with skipna=False. My thought here is that we would still preserve parts of the original behaviour, and only utilize the null information if needed.

However, after further thinking, I think that overcomplicates things. I think just changing skipna from True to False would suffice. Let me know what you think.

will-gp avatar Apr 19 '22 18:04 will-gp

However, after further thinking, I think that overcomplicates things. I think just changing skipna from True to False would suffice. Let me know what you think.

Agreed! I think skipna=False would handle all the cases (🤞), so let's go ahead with that!

Let me know if you have any questions about contributing!

cosmicBboy avatar Apr 19 '22 18:04 cosmicBboy

@cosmicBboy @will-gp please take a look at #944 🚀

tpvasconcelos avatar Sep 15 '22 23:09 tpvasconcelos

Thanks @cosmicBboy! I think we can close this one now 🚀

tpvasconcelos avatar Sep 21 '22 20:09 tpvasconcelos

fixed by #944

cosmicBboy avatar Sep 21 '22 20:09 cosmicBboy