pandera
pandera copied to clipboard
pa.infer_schema throws TypeError when column is null and dtype is object
Describe the bug
pa.infer_schema(df) throws a TypeError when a given column contains all nulls and the column dtype is object. This occurs within _get_array_type in schema_statistics.py. Specifically, if a column contains all null values, the call to infer_dtype on line 183 will return 'empty'. The subsequent call to pandas_engine.Engine.dtype(inferred_alias) will throw a TypeError. Note: This only occurs when the original column dtype is object.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/USER/miniconda3/envs/piggy/lib/python3.7/site-packages/pandera/schema_statistics.py", line 184, in _get_array_type
data_type = pandas_engine.Engine.dtype(inferred_alias)
File "/home/USER/miniconda3/envs/piggy/lib/python3.7/site-packages/pandera/engines/pandas_engine.py", line 147, in dtype
np_or_pd_dtype = pd.api.types.pandas_dtype(data_type)
File "/home/USER/miniconda3/envs/piggy/lib/python3.7/site-packages/pandas/core/dtypes/common.py", line 1776, in pandas_dtype
npdtype = np.dtype(dtype)
TypeError: data type 'empty' not understood
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of pandera.
- [x] (optional) I have confirmed this bug exists on the master branch of pandera.
Code Sample, a copy-pastable example
import pandas as pd
import pandera as pa
from pandera import schema_statistics
a = pd.Series([np.nan, np.nan])
# Change type to object
a = a.astype('object')
# This will throw type error
stats = schema_statistics._check_array_type(a)
Expected behavior
Given that infer_schema should make its best guess as to the schema, would it not make sense for it to do so, instead of throwing a TypeError?
We can actually use the null information during schema inference time to provide additional information. At the moment, null values are ignored during inference, as per line 183. However, the following calls to infer_dtype would produce a dtype estimate if skipna=False (and the original column is of dtype object):
[NaT, NaT, ..., NaT] --> would be inferred as datetime [np.nan, np.nan, ..., np.nan] --> would be inferred as floating point [None,..., None] --> would be inferred as mixed
I understand that this would change the expected behaviour of the schema inference function. It may make sense to have a second call to infer_drype with skipna=False if the initial call turns back 'empty'. Happy to discuss this as well as take this on once decided.
Desktop (please complete the following information):
- OS: Debian 10 (Buster)
- Python 3.7.2
Hi @will-gp thanks for the bug report, I do think the use case you describe should be supported.
However, the following calls to infer_dtype would produce a dtype estimate if skipna=False (and the original column is of dtype object)
+1 to skipna=False
Feel free to make changes to the schema_statistics module and add some unit tests for these cases:
[NaT, NaT, ..., NaT] --> would be inferred as datetime [np.nan, np.nan, ..., np.nan] --> would be inferred as floating point [None,..., None] --> would be inferred as mixed
It may make sense to have a second call to infer_drype with skipna=False if the initial call turns back 'empty'.
Can you elaborate on this point?
Let me know if you have any questions, and thanks in advance for your contribution!
Hey @cosmicBboy , my initial thought was that we'd first have a call to infer_dtype with skipna=True (the original logic). If this returns a dtype of empty we would then make a second call with skipna=False. My thought here is that we would still preserve parts of the original behaviour, and only utilize the null information if needed.
However, after further thinking, I think that overcomplicates things. I think just changing skipna from True to False would suffice. Let me know what you think.
However, after further thinking, I think that overcomplicates things. I think just changing skipna from True to False would suffice. Let me know what you think.
Agreed! I think skipna=False would handle all the cases (🤞), so let's go ahead with that!
Let me know if you have any questions about contributing!
@cosmicBboy @will-gp please take a look at #944 🚀
Thanks @cosmicBboy! I think we can close this one now 🚀
fixed by #944