databricks-sql-python `databricks.sql.exc.RequestError` when inserting more than 255 cells via `pandas.to

The following example works with databricks-sql-connector version 2.9.3, but fails with version 3.0.1:

import numpy as np
import pandas as pd 
from sqlalchemy import create_engine

sqlalchemy_connection_string = f"databricks://token:{access_token}@{host}?http_path={http_path}&catalog={catalog}&schema={schema}"
engine = create_engine(sqlalchemy_connection_string)
shape = (50, 16)
random_df = pd.DataFrame(
    data=np.random.standard_normal(size=shape),
    columns=[f"col{i}" for i in range(shape[1])]
)
table_name = "pandas_sqlalchemy_insert"
random_df.to_sql(table_name, engine, if_exists="replace", index=False, method="multi")

The error is: OperationalError: (databricks.sql.exc.RequestError) Error during request to server

With setting chunksize in random_df.to_sql to any value which is 15 or lower (to make the number of table cells inserted less than 256), the insert runs without issue.

What I have tried

Instead of random_df.to_sql(table_name, engine, if_exists="replace", index=False, method="multi") used:

with engine.connect() as conn:
    random_df.to_sql(table_name, conn, if_exists="replace", index=False, method="multi")

-- the observed behavior is the same in all cases.

Various other DataFrames and chunksizes. Anything under 256 cells works, anything with 256 or more cells fails with the described exception.

Versions

Not working: databricks-sql-connector==3.0.1, sqlalchemy==2.0.23, pandas==2.1.4 Working: databricks-sql-connector==2.9.3, sqlalchemy==1.4.50, pandas==2.1.4

Dec 11 '23 12:12 mbelak-dtml

Possibly related to how the Databricks SQLAlchemy dialect constructs queries, where databricks-sql-connector = ~3 uses parameters instead of directly injecting values into the INSERT statement, which version 2.9.3 did?

Dec 11 '23 12:12 mbelak-dtml

Thanks for your question. As of databricks-sql-connector==3.0.0 there is a major difference to how parameters are handled. This change (and the limitation of 255 parameters per query) is documented extensively here.

Possibly related to how the Databricks SQLAlchemy dialect constructs queries, where databricks-sql-connector = ~3 uses parameters instead of directly injecting values into the INSERT statement, which version 2.9.3 did?

This is pretty close. SQLAlchemy hasn't changed its approach, but the connector has. There are now two ways that the connector can send parameters: native and inline. Native is enabled by default and currently there is a limit at the server of no more than 255 parameters per query. While the older inline approach doesn't have this limitation, as documented here, the SQLAlchemy dialect in connector version 3 and above will only work with the native approach.

Unfortunately, the only workaround for this is to modify your to_sql call with a chunksize that guarantees that any individual request will not exceed 255 parameters. We are working with the server team to increase the limit of 255 parameters per request.

Dec 12 '23 18:12 susodapop

Thank you for the answer.

We are working with the server team to increase the limit of 255 parameters per request.

Is the limit on the number of parameters a limitation on the Databricks SQL endpoint side? Do you happen to know of any public information or information you can share (e.g. ETA) on this?

Dec 13 '23 10:12 mbelak-dtml

Yes the limitation is at the SQL warehouse. Still waiting to hear internally when this will be increased.

Dec 14 '23 00:12 susodapop

Hi Team. I work at Procore Technologies and am coming across the same issue. I found that I had to set my particular query to a chunksize of 28 rows, which now makes sense due to the 255 cell limit.

This dramatically slows down the process of writing to Databricks, to the point where our intended use case may not work.

I also haven't seen anywhere in the docs where it shows examples of creating a SQL Alchemy Engine and using pd.to_sql / read_sql. Our code looks just like the code at the beginning of this thread, and it would be great if it was clearly documented somewhere how to use pd.to_sql / read_sql.

Alternately, it would be great if Databricks provided an optimized version of these functions with the connector. As an example, Snowflake offers the following functions with their connector: fetch_pandas_all, fetch_bandas_batches, write_pandas.

We'd like to use SQLAlchemy 2.0+, which I believe requires that we use the Databricks-sql-connector 3.0+, so here's a +1 to sorting this out quickly. Thanks!

Jan 03 '24 00:01 newlandj

I also haven't seen anywhere in the docs where it shows examples of creating a SQL Alchemy Engine and using pd.to_sql / read_sql.

It's part of the sqlalchemy documentation in this repository: https://github.com/databricks/databricks-sql-python/blob/main/src/databricks/sqlalchemy/README.sqlalchemy.md#usage-with-pandas

Alternately, it would be great if Databricks provided an optimized version of these functions with the connector.

Thanks for registering your support for this. Can you send an email to [email protected] so we can follow-up with you?

Jan 03 '24 00:01 susodapop

Thanks for the link to the docs! I would recommend you link to that page on the main docs page here, as I assumed they would include everything I needed to know.

I'll shoot an email to that address.

Jan 03 '24 18:01 newlandj

Those docs are being updated as we speak :)

Jan 03 '24 21:01 susodapop

Update for you all: we're re-classifying this as a regression in the latest version of databricks-sql-connector and working to implement a fix that doesn't rely on the server supporting > 255 params. More details soon.

Jan 15 '24 16:01 susodapop

Hi team, any progress on this?

Mar 11 '24 23:03 newlandj

Hi, we're also facing the same issue. Could someone please provide an update on this? @susodapop

May 02 '24 13:05 akshay-s-ciq

Hi @akshay-s-ciq, since I'm no longer maintaining this repository I'd recommend contacting Databricks support directly to get the current status.

May 02 '24 13:05 susodapop

`databricks.sql.exc.RequestError` when inserting more than 255 cells via `pandas.to_sql`.

What I have tried

Versions