ResultSet `fetchmany_arrow`/`fetchall_arrow` methods fail during `concat_tables`

Open ksofeikov opened this issue 1 year ago • 10 comments

Hi there,

I'm using this client library to fetch much data from our DBX environment. The version I'm using is 3.3.0.

The library keeps crashing when attempts to concatenate two current and partial results. I can not attach to full trace, because it contains some of the internal schemas, but here is the gist of it:

creation_date
First Schema: creation_date: timestamp[us, tz=Etc/UTC]
Second Schema: creation_date: timestamp[us, tz=Etc/UTC] not null

status
First Schema: status: string
Second Schema: status: string not null

sender_id
First Schema: sender_id: string
Second Schema: sender_id: string not null

.
.
.
and a few other fields with the exact same discrepancy

The exact stack trace is

Traceback (most recent call last):
  File "/Users/X/work/scripts/raw_order.py", line 34, in <module>
    for r in tqdm(cursor, total=max_items):
  File "/Users/X/work/.venv/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/Users/X/work//.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 422, in __iter__
    for row in self.active_result_set:
  File "/Users/X/work//.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1112, in __iter__
    row = self.fetchone()
  File "/Users/X/work/.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1217, in fetchone
    res = self._convert_arrow_table(self.fetchmany_arrow(1))
  File "/Users/X/work/.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1193, in fetchmany_arrow
    results = pyarrow.concat_tables([results, partial_results])
  File "pyarrow/table.pxi", line 5962, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Schema at index 1 was different: 
GOES INTO DETAILS SPECIFIED ABOVE AS TO WHAT IS DIFFERNT

The data is coming through a cursor like this

connection = sql.connect(
    server_hostname="X",
    http_path="B",
    access_token=app_settings.dbx_access_token,
)

cursor = connection.cursor()

max_items = 100000
batch_size = 10000

cursor.execute(
    f"SELECT * from X where  creation_date between '2024-06-01' and '2024-09-01' limit {max_items}"
)

The source table is created through a CTAS statement, so all fields are nullable by default. I have found two ways to resolve the issue:

either set results = pyarrow.concat_tables([results, partial_results], promote_options="permissive") promote options to permissive to so pyarrow can marry two schemas, or
Downgrade to the latest previous major version 2.9.6

I checked the 2.9.6 source code and it does not seem to be using a permissive schema casting, so seems like a regression in this case.

I'm not sure if I can add anything else beyond that, but do let me know.

And to be clear, I request like 100k records at a time there, and can iterate through like 95k of them, and then it fails. So I'm not really sure if there is a reliable way to reproduce that

If the cluster runtime matters, 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)

Thanks!

Jul 29 '24 21:07 ksofeikov