connector-x icon indicating copy to clipboard operation
connector-x copied to clipboard

cx read 9999-12-31 to 1689-12-31

Open wonb168 opened this issue 2 years ago • 6 comments

What language are you using?

Python.

What version are you using?

0.3.1.

What database are you using?

PostgreSQL

What dataframe are you using?

Pandas and Arrow

Can you describe your bug?

my datawarehouse have timestamp column 9999-12-31,but cx cant read this year.

What are the steps to reproduce the behavior?

If possible, please include a minimal simple example including:

Database setup if the error only happens on specific data or data type

Table schema and example data

Example query / code
sql="select '9999-12-31':timestamp as ts"
df=cx.read_sql(con,sql)
df.head()

1816-03-29 05:56:08.066277376

What is the error.

year 9999 to 1816

wonb168 avatar Dec 08 '22 14:12 wonb168

Related to #415 .

Essentially we produce an integer value for each timestamp (in ns), but it has the limitation of covered range.

wangxiaoying avatar Dec 09 '22 03:12 wangxiaoying

then, how to get the right timestamp? and, why pandas.read_sql has no this problem?

wonb168 avatar Dec 09 '22 03:12 wonb168

This seems to be more of a problem of what return_type you specify, setting it to either arrow or polars produces the correct output:

cx.read_sql(con, "select '9999-12-31'::timestamp", return_type="polars")
shape: (1, 1)
┌─────────────────────┐
│ timestamp           │
│ ---                 │
│ datetime[ms]        │
╞═════════════════════╡
│ 9999-12-31 00:00:00 │
└─────────────────────┘
cx.read_sql(con, "select '9999-12-31'::timestamp", return_type="arrow")
pyarrow.Table
timestamp: date64[ms]
----
timestamp: [[9999-12-31]]

This is using Postgres 14, Polars 0.15.8 and pyarrow 10.0.0

M-Salti avatar Dec 31 '22 15:12 M-Salti

versions: postgres - 13 polars - 0.16.18 cx - 0.3.2 alpha 2 pyarrow - 11.0.0

I am able to execute the query above for arrow but not for polars though I can call pl.from_arrow() successfully on the arrow table

rafmagns-skepa-dreag avatar Jul 05 '23 23:07 rafmagns-skepa-dreag

We're seeing this too with connectorx to Pandas return type.

The returned dataframe includes the default Pandas datetime precision: datetime64[ns], which is not capable of expressing a date in the year 9999.

why pandas.read_sql has no this problem?

Pandas does have the problem but doesn't error out because if dates come through that are beyond the bounds of datetime64[ns], the column/Series data type is instead object with embedded Python datetime values. It works but you can't do any vectorized processing on those dates.

Even in Pandas 2, it uses nanoseconds internally in many places so it can be difficult to avoid. We are looking at Pandas alternatives in part because of this issue.

What does work is dropping output to Pyarrow/Polars, then converting to Pandas with the use_pyarrow_extension_array=True flag. Then the Pandas dataframe has a Pyarrow date or datetime datatype which has no issue supporting these dates.

klaerik avatar Sep 08 '23 22:09 klaerik

After some more research, the issue with nanosecond precision affects all timestamps in connectorx. We just weren't seeing it due to conversion of Oracle DATE datatypes, which are really timestamps, into date32[day] format. With the latest release of connectorx, that conversion was fixed so the issue around nanosecond precision is showing up for Oracle DATEs now too.

More details in #495 which is the same problem.

This is a fairly big problem for people that work data in a data warehousing context, particularly on Oracle platforms where there is no true date-only datatype. There are often date columns in the database for type II slowly changing dimensions that have end dates in far future years.

klaerik avatar Sep 19 '23 16:09 klaerik