ResultSet `fetchmany_arrow`/`fetchall_arrow` methods fail during `concat_tables`
Hi there,
I'm using this client library to fetch much data from our DBX environment. The version I'm using is 3.3.0.
The library keeps crashing when attempts to concatenate two current and partial results. I can not attach to full trace, because it contains some of the internal schemas, but here is the gist of it:
creation_date
First Schema: creation_date: timestamp[us, tz=Etc/UTC]
Second Schema: creation_date: timestamp[us, tz=Etc/UTC] not null
status
First Schema: status: string
Second Schema: status: string not null
sender_id
First Schema: sender_id: string
Second Schema: sender_id: string not null
.
.
.
and a few other fields with the exact same discrepancy
The exact stack trace is
Traceback (most recent call last):
File "/Users/X/work/scripts/raw_order.py", line 34, in <module>
for r in tqdm(cursor, total=max_items):
File "/Users/X/work/.venv/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/Users/X/work//.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 422, in __iter__
for row in self.active_result_set:
File "/Users/X/work//.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1112, in __iter__
row = self.fetchone()
File "/Users/X/work/.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1217, in fetchone
res = self._convert_arrow_table(self.fetchmany_arrow(1))
File "/Users/X/work/.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1193, in fetchmany_arrow
results = pyarrow.concat_tables([results, partial_results])
File "pyarrow/table.pxi", line 5962, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
GOES INTO DETAILS SPECIFIED ABOVE AS TO WHAT IS DIFFERNT
The data is coming through a cursor like this
connection = sql.connect(
server_hostname="X",
http_path="B",
access_token=app_settings.dbx_access_token,
)
cursor = connection.cursor()
max_items = 100000
batch_size = 10000
cursor.execute(
f"SELECT * from X where creation_date between '2024-06-01' and '2024-09-01' limit {max_items}"
)
The source table is created through a CTAS statement, so all fields are nullable by default. I have found two ways to resolve the issue:
- either set
results = pyarrow.concat_tables([results, partial_results], promote_options="permissive")promote options topermissiveto so pyarrow can marry two schemas, or - Downgrade to the latest previous major version
2.9.6
I checked the 2.9.6 source code and it does not seem to be using a permissive schema casting, so seems like a regression in this case.
I'm not sure if I can add anything else beyond that, but do let me know.
And to be clear, I request like 100k records at a time there, and can iterate through like 95k of them, and then it fails. So I'm not really sure if there is a reliable way to reproduce that
If the cluster runtime matters, 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)
Thanks!