connector-x
                                
                                 connector-x copied to clipboard
                                
                                    connector-x copied to clipboard
                            
                            
                            
                        cx read 9999-12-31 to 1689-12-31
What language are you using?
Python.
What version are you using?
0.3.1.
What database are you using?
PostgreSQL
What dataframe are you using?
Pandas and Arrow
Can you describe your bug?
my datawarehouse have timestamp column 9999-12-31,but cx cant read this year.
What are the steps to reproduce the behavior?
If possible, please include a minimal simple example including:
Database setup if the error only happens on specific data or data type
Table schema and example data
Example query / code
sql="select '9999-12-31':timestamp as ts"
df=cx.read_sql(con,sql)
df.head()
1816-03-29 05:56:08.066277376
What is the error.
year 9999 to 1816
Related to #415 .
Essentially we produce an integer value for each timestamp (in ns), but it has the limitation of covered range.
then, how to get the right timestamp? and, why pandas.read_sql has no this problem?
This seems to be more of a problem of what return_type you specify, setting it to either arrow or polars produces the correct output:
cx.read_sql(con, "select '9999-12-31'::timestamp", return_type="polars")
shape: (1, 1)
┌─────────────────────┐
│ timestamp           │
│ ---                 │
│ datetime[ms]        │
╞═════════════════════╡
│ 9999-12-31 00:00:00 │
└─────────────────────┘
cx.read_sql(con, "select '9999-12-31'::timestamp", return_type="arrow")
pyarrow.Table
timestamp: date64[ms]
----
timestamp: [[9999-12-31]]
This is using Postgres 14, Polars 0.15.8 and pyarrow 10.0.0
versions: postgres - 13 polars - 0.16.18 cx - 0.3.2 alpha 2 pyarrow - 11.0.0
I am able to execute the query above for arrow but not for polars though I can call pl.from_arrow() successfully on the arrow table
We're seeing this too with connectorx to Pandas return type.
The returned dataframe includes the default Pandas datetime precision: datetime64[ns], which is not capable of expressing a date in the year 9999.
why pandas.read_sql has no this problem?
Pandas does have the problem but doesn't error out because if dates come through that are beyond the bounds of datetime64[ns], the column/Series data type is instead object with embedded Python datetime values. It works but you can't do any vectorized processing on those dates.
Even in Pandas 2, it uses nanoseconds internally in many places so it can be difficult to avoid. We are looking at Pandas alternatives in part because of this issue.
What does work is dropping output to Pyarrow/Polars, then converting to Pandas with the use_pyarrow_extension_array=True flag. Then the Pandas dataframe has a Pyarrow date or datetime datatype which has no issue supporting these dates.
After some more research, the issue with nanosecond precision affects all timestamps in connectorx. We just weren't seeing it due to conversion of Oracle DATE datatypes, which are really timestamps, into date32[day] format. With the latest release of connectorx, that conversion was fixed so the issue around nanosecond precision is showing up for Oracle DATEs now too.
More details in #495 which is the same problem.
This is a fairly big problem for people that work data in a data warehousing context, particularly on Oracle platforms where there is no true date-only datatype. There are often date columns in the database for type II slowly changing dimensions that have end dates in far future years.