polars icon indicating copy to clipboard operation
polars copied to clipboard

Unreliable schema for datetime columns and error in .glimpse()

Open mkleinbort-ic opened this issue 3 years ago • 1 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

Polars does not seem to handle python datetime and pd.datetime objects reliably.

Consider this code:

import pandas as pd
import numpy as np 

pd.DataFrame({'Date':['2023-01-01']}).astype({'Date':'datetime64[ns]'}).pipe(pl.from_pandas).glimpse()

It results in this error:

File {...}\polars\internals\dataframe\frame.py:2615, in DataFrame.glimpse(self)
   2612     val_str = ", ".join(map(str, val))
   2613     return col_name, dtype_str, val_str
-> 2615 data = [_parse_column(s) for s in self.columns]
   2617 # we make the first column as small as possible by taking the longest
   2618 # column name
   2619 max_col_name = max((len(col_name) for col_name, _, _ in data))

File {...}\polars\internals\dataframe\frame.py:2615, in <listcomp>(.0)
   2612     val_str = ", ".join(map(str, val))
   2613     return col_name, dtype_str, val_str
-> 2615 data = [_parse_column(s) for s in self.columns]
   2617 # we make the first column as small as possible by taking the longest
   2618 # column name
   2619 max_col_name = max((len(col_name) for col_name, _, _ in data))

File {...}\polars\internals\dataframe\frame.py:2610, in DataFrame.glimpse.<locals>._parse_column(col_name)
   2608 def _parse_column(col_name: str) -> tuple[str, str, str]:
   2609     s = self[col_name]
-> 2610     dtype_str = "<" + s.dtype.__name__ + ">"
   2611     val = s[:max_num_values].to_list()
   2612     val_str = ", ".join(map(str, val))

AttributeError: 'Datetime' object has no attribute '__name__'

This seems rooted in the following error: The entry in the pandas dataframe after the .astype({'Date':'datetime64[ns]'}) is of type pandas._libs.tslibs.timestamps.Timestamp

This converts into Polars just fine and is presented as a datetime64, but it appears it's not really that

print(pd.DataFrame({'Date':['2023-01-01']}).astype({'Date':'datetime64[ns]'}).pipe(pl.from_pandas))

shape: (1, 1)
┌─────────────────────┐
│ Date                │
│ ---                 │
│ datetime[ns]        │
╞═════════════════════╡
│ 2023-01-01 00:00:00 │
└─────────────────────┘

Reproducible example

import pandas as pd
import numpy as np 

pd.DataFrame({'Date':['2023-01-01']}).astype({'Date':'datetime64[ns]'}).pipe(pl.from_pandas).glimpse()

Expected behavior

These should work:

from datetime import datetime
import pandas as pd
import numpy as np 

case1 = pd.DataFrame({'Date':['2023-01-01']}).astype({'Date':'datetime64[ns]'}).pipe(pl.from_pandas).glimpse()

case2 = pl.DataFrame({'Date':[datetime(2023,1,1)]}).glimpse()

Installed versions

---Version info---
Polars: 0.15.13
Index type: UInt32
Platform: Windows-10-10.0.22000-SP0
Python: 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 8.0.0
pandas: 1.5.2
numpy: 1.22.4
fsspec: 2022.8.2
connectorx: 0.3.1
xlsx2csv: <not installed>
matplotlib: 3.6.2

mkleinbort-ic avatar Jan 06 '23 12:01 mkleinbort-ic

There are other datatypes that also don't wokrm like lists:

In [51]: df2 = pl.DataFrame({
    ...:     "text": ["sample1"],
    ...:     "list": [[1, 2]]
    ...: })

In [52]: df2.glimpse()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-52-419f16bdd39a> in <module>
----> 1 df2.glimpse()

~/software/polars/py-polars/polars/internals/dataframe/frame.py in glimpse(self)
   2613             return col_name, dtype_str, val_str
   2614 
-> 2615         data = [_parse_column(s) for s in self.columns]
   2616 
   2617         # we make the first column as small as possible by taking the longest

~/software/polars/py-polars/polars/internals/dataframe/frame.py in <listcomp>(.0)
   2613             return col_name, dtype_str, val_str
   2614 
-> 2615         data = [_parse_column(s) for s in self.columns]
   2616 
   2617         # we make the first column as small as possible by taking the longest

~/software/polars/py-polars/polars/internals/dataframe/frame.py in _parse_column(col_name)
   2608         def _parse_column(col_name: str) -> tuple[str, str, str]:
   2609             s = self[col_name]
-> 2610             dtype_str = "<" + s.dtype.__name__ + ">"
   2611             val = s[:max_num_values].to_list()
   2612             val_str = ", ".join(map(str, val))

AttributeError: 'List' object has no attribute '__name__'

Using schema in glimpse it probably a safer approach:

In [54]: for col_name, dtype in df.schema.items():
    ...:     print(col_name, dtype, dtype.string_repr())
    ...: 
Date Datetime(tu='ns', tz=None) datetime[μs]

In [53]: for col_name, dtype in df2.schema.items():
    ...:     print(col_name, dtype, dtype.string_repr())
    ...: 
text Utf8 str
list List(Int64) list[bool]

ghuls avatar Jan 06 '23 12:01 ghuls

I have fixed the glimpse issue to use schema and string_repr, see #6091.

zundertj avatar Jan 06 '23 23:01 zundertj

Can this be closed now @zundertj?

ritchie46 avatar Jan 07 '23 15:01 ritchie46