connector-x
connector-x copied to clipboard
Timezones for Arrow when loading from postgres
Info
- language: python
- connectorx version:
0.4.3 - tested on Postgres, but we use Connectorx for multiple backends.
- uses
pyarrow^=18
Description
Changes from tz="utc" to tz=+00:00 described in this issue and implemented here are making test fails (changes introduced in 0.4.2).
Investigation
- We are using
return_type=arrow2deprecated in0.4.2and are now upgrading to0.4.3. - I switched to
return_type=arrowand some tests started failing withCannot locate timezone '+00:00': +00:00 not found in timezone database.
- I revert to
0.4.1, and nowreturn_type=arrowworks without issue. - I pinpoint the difference to the returning value of
connectorx.read_sql(...). Thetimestamptimezone value differs between the two schema.# in 0.4.1 with `arrow` id: int64 email: string display_name: string created_at: timestamp[us, tz=UTC] updated_at: timestamp[us, tz=UTC] # in 0.4.3 with `arrow` id: int32 not null email: string not null display_name: string not null created_at: timestamp[us, tz=+00:00] not null updated_at: timestamp[us, tz=+00:00] not null - In our codebase, the errors is thrown when calling pyarrow.csv.CSVWrite().write().
- During manual inspection, we see that the data in the array looks the same between
0.4.1and0.4.3. In0.4.3, the data can't be accessed at all suggesting that the issue is not specific to pyarrow.csv.CSVWriter# in 0.4.1 (success) (Pdb) df.column("created_at") <pyarrow.lib.ChunkedArray object at 0x79d617d8ebc0> [ [ 2025-05-27 00:34:26.698854Z, 2025-05-27 00:35:42.698854Z, ... ] ] (Pdb) df.column("created_at")[0] <pyarrow.TimestampScalar: '2025-05-27T00:34:26.698854+0000'> # in 0.4.3 (failure) (Pdb) df.column("created_at") <pyarrow.lib.ChunkedArray object at 0x76a42204ad40> [ [ 2025-05-27 00:35:38.619809Z, 2025-05-27 00:37:42.619809Z, ... ] ] (Pdb) df.column("created_at")[0] *** pyarrow.lib.ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not found in timezone database
Reading the issue and the context here. I'm suprised that someone has the opposite issue (UTC is unknown, but +00:00 is known). Maybe the arrow rust and Python tooling come with different default known timezones? If it's the case, maybe it should be raised to arrow
Repro
I'm on Ubuntu 24.04. I know Windows can have errors with the IANA timezone database.
postgres repro
The table is created via sqlalchemy in a test fixture
Table(
"app_user",
self.metadata,
Column("id", Integer(), primary_key=True, autoincrement=True),
Column("email", Text(), nullable=False, unique=True),
Column("display_name", Text(), nullable=False),
Column(
"created_at",
DateTime(timezone=True),
nullable=False,
server_default=func.now(),
),
Column(
"updated_at",
DateTime(timezone=True),
nullable=False,
server_default=func.now(),
),
)
pyarrow repro
I asked the pyarrow docs chatbot available for a repro. The code it generates also produces the same error.
import pyarrow as pa
from datetime import datetime
ts_type = pa.timestamp('s', tz='+07:30')
datetimes = [datetime(2020, 1, 1, 8, 0, 0), datetime(2020, 1, 2, 9, 30, 0)]
array = pa.array(datetimes, type=ts_type)
print(array)
>>> array[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/scalar.pxi", line 735, in pyarrow.lib.TimestampScalar.__repr__
File "/home/tjean/projects/dlthub/dlt/.venv/lib/python3.11/site-packages/pyarrow/compute.py", line 269, in wrapper
return func.call(args, options, memory_pool)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_compute.pyx", line 407, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot locate timezone '+07:30': +07:30 not found in timezone database
Upgrading to the latest pyarrow^=20 didn't solve the issue.
Solutions
- We pin dependencies to
<=0.4.1and setreturn_type=arrow. Though, we'd like to benefit from the improvements to decimal handling. - We manually
pyarrow.Table.cast()on the return values ofconnectorx.read_sql()if it contains timezones
Additional context
I maintain the Python library dlt and we offer ConnectorX as an optional dependency. It provides significant performance improvements and we promote it to OSS and enterprise users.