pandera
pandera copied to clipboard
date32 type not supported using infer_schema
Describe the bug A clear and concise description of what the bug is.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
Hoping that pandera can handle date32 types, but this appears to raise an error.
Code Sample, a copy-pastable example
import pandas as pd
import pyarrow as pa
from io import BytesIO
import pandera
df = pd.DataFrame([pd.Timestamp.now().date()], columns=['mydate'])
pqtypes = {
'mydate': pa.date32(),
}
buffer = BytesIO()
df.to_parquet(
buffer,
engine='pyarrow',
schema=pa.schema([pa.field(x, y) for x, y in pqtypes.items()])
)
buffer.seek(0)
del df
df2 = pd.read_parquet(buffer)
pandera.infer_schema(df2)
Traceback (most recent call last):
File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\engines\pandas_engine.py", line 137, in dtype
return engine.Engine.dtype(cls, data_type)
File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\engines\engine.py", line 210, in dtype
raise TypeError(
TypeError: Data type 'date' not understood by Engine.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users//AppData/Roaming/JetBrains/PyCharmCE2021.3/scratches/scratch.py", line 25, in <module>
pandera.infer_schema(df2)
File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\schema_inference.py", line 38, in infer_schema
return infer_dataframe_schema(pandas_obj)
File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\schema_inference.py", line 72, in infer_dataframe_schema
df_statistics = infer_dataframe_statistics(df)
File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\schema_statistics.py", line 15, in infer_dataframe_statistics
inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\schema_statistics.py", line 15, in <dictcomp>
inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\schema_statistics.py", line 185, in _get_array_type
data_type = pandas_engine.Engine.dtype(inferred_alias)
File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\engines\pandas_engine.py", line 155, in dtype
np_or_pd_dtype = pd.api.types.pandas_dtype(data_type)
File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandas\core\dtypes\common.py", line 1777, in pandas_dtype
npdtype = np.dtype(dtype)
TypeError: data type 'date' not understood
Expected behavior
Hoping basic date32 types can be handled along with timestamps.
Desktop (please complete the following information):
- OS: Windows
- pandera: 0.11.0
- pandas 1.4.2
- pyarrow: 8.0.0
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
Hi @andycarter85. Thanks for the reproducible example.
Pyarrow supports date types (date32 and date64) but pandas does not. Pandas only supports date times with numpy.datetime64
.
However, pandas does allow to wrap any python object in the numpy.object
data type. That's how pyarrow can translate its own date types to pandas.
If I expand your example:
df2.info()
#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 1 entries, 0 to 0
#> Data columns (total 1 columns):
#> # Column Non-Null Count Dtype
#> --- ------ -------------- -----
#> 0 mydate 1 non-null object
#> dtypes: object(1)
#> memory usage: 136.0+ bytes
# looking at the first element of the "date32" column
print(f"{type(df2.iloc[0,0])=}")
#> type(df2.iloc[0,0])=<class 'datetime.date'>
You see that pd.read_parquet
(pyarrow under the hood) has translated date32
to python's standard datetime.date
.
We recently added support for logical data types, a mechanism to cover extra data types not officially supported by pandas. For example, we added a Decimal data type, which pyarrow supports but is also boxed in an object column. Logical data types should be part of the next release.
@cosmicBboy @andycarter85 tl;dr: I can add support for a Date logical type to extend the coverage of pyarrow types.
I can add support for a Date logical type to extend the coverage of pyarrow types.
Yes, that would be awesome!
@andycarter85 does using the object
type work for you as a temporary workaround?
I'm not very familiar with pandera at the moment, is there a way I can adapt my infer_schema
call temporarily until support for Date types is introduced?
@andycarter85 how are you using infer_schema
in your workflow?
I am just getting started with pandera tbh, we have some large pre-existing datasets that I wanted to try inferring a yaml schema for, and then iterate from there, rather than start building a schema from scratch .
As it looks like the issue has been resolved in #887 then happy to wait for the next release and try again then.
Not sure if the pa.date32
type is important in your use case, but a workaround here would be to convert all the columns containing dates into pandas-supported datetime64
before calling infer_schema
.
import pandas as pd
import pyarrow as pa
from io import BytesIO
import pandera
df = pd.DataFrame([pd.Timestamp.now().date()], columns=['mydate'])
pqtypes = {
'mydate': pa.date32(),
}
buffer = BytesIO()
df.to_parquet(
buffer,
engine='pyarrow',
schema=pa.schema([pa.field(x, y) for x, y in pqtypes.items()])
)
buffer.seek(0)
df2 = pd.read_parquet(buffer).astype({k: "datetime64[ns]" for k in pqtypes})
schema = pandera.infer_schema(df2)
print(schema.to_script())
Another thing to do would be to register a custom dtype (see https://pandera.readthedocs.io/en/stable/dtypes.html#example) but it would inherit from pandas_enginer.DateTime
. The coerce
method would then handle the conversion of datetime.date
objects into pandas-supported datetime64[ns]
.