connector-x icon indicating copy to clipboard operation
connector-x copied to clipboard

Timezones for Arrow when loading from postgres

Open zilto opened this issue 5 months ago • 1 comments

Info

  • language: python
  • connectorx version: 0.4.3
  • tested on Postgres, but we use Connectorx for multiple backends.
  • uses pyarrow^=18

Description

Changes from tz="utc" to tz=+00:00 described in this issue and implemented here are making test fails (changes introduced in 0.4.2).

Investigation

  • We are using return_type=arrow2 deprecated in 0.4.2 and are now upgrading to 0.4.3.
  • I switched to return_type=arrow and some tests started failing with

    Cannot locate timezone '+00:00': +00:00 not found in timezone database.

  • I revert to 0.4.1, and now return_type=arrow works without issue.
  • I pinpoint the difference to the returning value of connectorx.read_sql(...). The timestamp timezone value differs between the two schema.
    # in 0.4.1 with `arrow`
    id: int64
    email: string
    display_name: string
    created_at: timestamp[us, tz=UTC]
    updated_at: timestamp[us, tz=UTC]
    
    # in 0.4.3 with `arrow`
    id: int32 not null
    email: string not null
    display_name: string not null
    created_at: timestamp[us, tz=+00:00] not null
    updated_at: timestamp[us, tz=+00:00] not null
    
  • In our codebase, the errors is thrown when calling pyarrow.csv.CSVWrite().write().
  • During manual inspection, we see that the data in the array looks the same between 0.4.1 and 0.4.3. In 0.4.3, the data can't be accessed at all suggesting that the issue is not specific to pyarrow.csv.CSVWriter
    # in 0.4.1 (success)
    (Pdb) df.column("created_at")
    <pyarrow.lib.ChunkedArray object at 0x79d617d8ebc0>
    [
      [
        2025-05-27 00:34:26.698854Z,
        2025-05-27 00:35:42.698854Z,
        ...
      ]
    ]
    (Pdb) df.column("created_at")[0]
    <pyarrow.TimestampScalar: '2025-05-27T00:34:26.698854+0000'>
    
    # in 0.4.3 (failure)
    (Pdb) df.column("created_at")
    <pyarrow.lib.ChunkedArray object at 0x76a42204ad40>
    [
      [
        2025-05-27 00:35:38.619809Z,
        2025-05-27 00:37:42.619809Z,
        ...
      ]
    ]
    (Pdb) df.column("created_at")[0]
    *** pyarrow.lib.ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not found in timezone database
    

Reading the issue and the context here. I'm suprised that someone has the opposite issue (UTC is unknown, but +00:00 is known). Maybe the arrow rust and Python tooling come with different default known timezones? If it's the case, maybe it should be raised to arrow

Repro

I'm on Ubuntu 24.04. I know Windows can have errors with the IANA timezone database.

postgres repro

The table is created via sqlalchemy in a test fixture

        Table(
            "app_user",
            self.metadata,
            Column("id", Integer(), primary_key=True, autoincrement=True),
            Column("email", Text(), nullable=False, unique=True),
            Column("display_name", Text(), nullable=False),
            Column(
                "created_at",
                DateTime(timezone=True),
                nullable=False,
                server_default=func.now(),
            ),
            Column(
                "updated_at",
                DateTime(timezone=True),
                nullable=False,
                server_default=func.now(),
            ),
        )

pyarrow repro

I asked the pyarrow docs chatbot available for a repro. The code it generates also produces the same error.

import pyarrow as pa
from datetime import datetime

ts_type = pa.timestamp('s', tz='+07:30')
datetimes = [datetime(2020, 1, 1, 8, 0, 0), datetime(2020, 1, 2, 9, 30, 0)]
array = pa.array(datetimes, type=ts_type)
print(array)

>>> array[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/scalar.pxi", line 735, in pyarrow.lib.TimestampScalar.__repr__
  File "/home/tjean/projects/dlthub/dlt/.venv/lib/python3.11/site-packages/pyarrow/compute.py", line 269, in wrapper
    return func.call(args, options, memory_pool)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_compute.pyx", line 407, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot locate timezone '+07:30': +07:30 not found in timezone database

Upgrading to the latest pyarrow^=20 didn't solve the issue.

Solutions

  • We pin dependencies to <=0.4.1 and set return_type=arrow. Though, we'd like to benefit from the improvements to decimal handling.
  • We manually pyarrow.Table.cast() on the return values of connectorx.read_sql() if it contains timezones

Additional context

I maintain the Python library dlt and we offer ConnectorX as an optional dependency. It provides significant performance improvements and we promote it to OSS and enterprise users.

zilto avatar May 27 '25 01:05 zilto