Unreliable schema for datetime columns and error in .glimpse()
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
Polars does not seem to handle python datetime and pd.datetime objects reliably.
Consider this code:
import pandas as pd
import numpy as np
pd.DataFrame({'Date':['2023-01-01']}).astype({'Date':'datetime64[ns]'}).pipe(pl.from_pandas).glimpse()
It results in this error:
File {...}\polars\internals\dataframe\frame.py:2615, in DataFrame.glimpse(self)
2612 val_str = ", ".join(map(str, val))
2613 return col_name, dtype_str, val_str
-> 2615 data = [_parse_column(s) for s in self.columns]
2617 # we make the first column as small as possible by taking the longest
2618 # column name
2619 max_col_name = max((len(col_name) for col_name, _, _ in data))
File {...}\polars\internals\dataframe\frame.py:2615, in <listcomp>(.0)
2612 val_str = ", ".join(map(str, val))
2613 return col_name, dtype_str, val_str
-> 2615 data = [_parse_column(s) for s in self.columns]
2617 # we make the first column as small as possible by taking the longest
2618 # column name
2619 max_col_name = max((len(col_name) for col_name, _, _ in data))
File {...}\polars\internals\dataframe\frame.py:2610, in DataFrame.glimpse.<locals>._parse_column(col_name)
2608 def _parse_column(col_name: str) -> tuple[str, str, str]:
2609 s = self[col_name]
-> 2610 dtype_str = "<" + s.dtype.__name__ + ">"
2611 val = s[:max_num_values].to_list()
2612 val_str = ", ".join(map(str, val))
AttributeError: 'Datetime' object has no attribute '__name__'
This seems rooted in the following error:
The entry in the pandas dataframe after the .astype({'Date':'datetime64[ns]'}) is of type pandas._libs.tslibs.timestamps.Timestamp
This converts into Polars just fine and is presented as a datetime64, but it appears it's not really that
print(pd.DataFrame({'Date':['2023-01-01']}).astype({'Date':'datetime64[ns]'}).pipe(pl.from_pandas))
shape: (1, 1)
┌─────────────────────┐
│ Date │
│ --- │
│ datetime[ns] │
╞═════════════════════╡
│ 2023-01-01 00:00:00 │
└─────────────────────┘
Reproducible example
import pandas as pd
import numpy as np
pd.DataFrame({'Date':['2023-01-01']}).astype({'Date':'datetime64[ns]'}).pipe(pl.from_pandas).glimpse()
Expected behavior
These should work:
from datetime import datetime
import pandas as pd
import numpy as np
case1 = pd.DataFrame({'Date':['2023-01-01']}).astype({'Date':'datetime64[ns]'}).pipe(pl.from_pandas).glimpse()
case2 = pl.DataFrame({'Date':[datetime(2023,1,1)]}).glimpse()
Installed versions
---Version info---
Polars: 0.15.13
Index type: UInt32
Platform: Windows-10-10.0.22000-SP0
Python: 3.10.7 (tags/v3.10.7:6cc6b13, Sep 5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 8.0.0
pandas: 1.5.2
numpy: 1.22.4
fsspec: 2022.8.2
connectorx: 0.3.1
xlsx2csv: <not installed>
matplotlib: 3.6.2
There are other datatypes that also don't wokrm like lists:
In [51]: df2 = pl.DataFrame({
...: "text": ["sample1"],
...: "list": [[1, 2]]
...: })
In [52]: df2.glimpse()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-52-419f16bdd39a> in <module>
----> 1 df2.glimpse()
~/software/polars/py-polars/polars/internals/dataframe/frame.py in glimpse(self)
2613 return col_name, dtype_str, val_str
2614
-> 2615 data = [_parse_column(s) for s in self.columns]
2616
2617 # we make the first column as small as possible by taking the longest
~/software/polars/py-polars/polars/internals/dataframe/frame.py in <listcomp>(.0)
2613 return col_name, dtype_str, val_str
2614
-> 2615 data = [_parse_column(s) for s in self.columns]
2616
2617 # we make the first column as small as possible by taking the longest
~/software/polars/py-polars/polars/internals/dataframe/frame.py in _parse_column(col_name)
2608 def _parse_column(col_name: str) -> tuple[str, str, str]:
2609 s = self[col_name]
-> 2610 dtype_str = "<" + s.dtype.__name__ + ">"
2611 val = s[:max_num_values].to_list()
2612 val_str = ", ".join(map(str, val))
AttributeError: 'List' object has no attribute '__name__'
Using schema in glimpse it probably a safer approach:
In [54]: for col_name, dtype in df.schema.items():
...: print(col_name, dtype, dtype.string_repr())
...:
Date Datetime(tu='ns', tz=None) datetime[μs]
In [53]: for col_name, dtype in df2.schema.items():
...: print(col_name, dtype, dtype.string_repr())
...:
text Utf8 str
list List(Int64) list[bool]
I have fixed the glimpse issue to use schema and string_repr, see #6091.
Can this be closed now @zundertj?