Unpin `pandas`
I would like to be able to use this library with the latest pandas version. Currently pandas is pinned to <2.2.0:
https://github.com/databricks/databricks-sql-python/blob/05529900858d40add7bc9b7e4a8864921680cfa2/pyproject.toml#L14-L16
It would be good to remove this restriction.
The pin was added in:
- https://github.com/databricks/databricks-sql-python/pull/330
To fix the issue described in:
- https://github.com/databricks/databricks-sql-python/issues/326
...but that just avoids the problem whilst causing another problem; this library can't be used with the latest pandas :/
I'm opening this issue to track any progress towards compatibility with the latest pandas version.
Bump! I would like to upgrade to the latest version but am stuck on 3.0.1 because of this pin 😔
Does 3.0.1 work with latest pandas? That would be an interesting data point.
Does 3.0.1 work with latest pandas? That would be an interesting data point.
I've been using 3.0.1 in combination with pandas 2.2.2 with no issues:
❯ pip list | rg 'pandas|databricks'
databricks-connect 14.3.1
databricks-sdk 0.20.0
databricks-sql-connector 3.0.1
pandas 2.2.2
...but that's apparently because I don't query all int data sources.
Running:
with engine.connect() as conn:
res = conn.execute(sa.text("select 1")).scalar_one()
gives:
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
It seems like it doesn't like assigning a None into an integer array:
> /opt/python/envs/dev310/lib/python3.10/site-packages/pandas/core/internals/managers.py(1703)as_array()
1701 pass
1702 else:
-> 1703 arr[isna(arr)] = na_value
1704
1705 return arr.transpose()
ipdb> arr
array([[1]], dtype=int32)
ipdb> isna(arr)
array([[False]])
ipdb> na_value
ipdb> na_value is None
True
If we go up the stack we can see we get type errors if we try to assign anything other than an integer:
> /opt/python/envs/dev310/lib/python3.10/site-packages/databricks/sql/client.py(1149)_convert_arrow_table()
1147 )
1148
-> 1149 res = df.to_numpy(na_value=None)
1150 return [ResultRow(*v) for v in res]
1151
ipdb> df.to_numpy(na_value=None)
*** TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
ipdb> df.to_numpy(na_value=float('NaN'))
*** ValueError: cannot convert float NaN to integer
ipdb> df.to_numpy(na_value=-99)
array([[1]], dtype=int32)
Casting to object before assigning does seem to work:
ipdb> df.astype(object).to_numpy(na_value=None)
array([[1]], dtype=object)
The problematic function: https://github.com/databricks/databricks-sql-python/blob/a6e9b11131871de8b673e3072c5b64498df68217/src/databricks/sql/client.py#L1130-L1166
I can work around the issue by disabling pandas:
with engine.connect() as conn:
cursor = conn.connection.cursor()
cursor.connection.disable_pandas = True
res = cursor.execute("select 1").fetchall()
>>> res
[Row(1=1)]
...but obviously the casting to numpy needs to be fixed.
Probably casting to object before assigning a None value is the right fix.
I second this. I cannot use pd.read_sql_query() because of this requirement.
Also, it would be good if you delete the distutils dependency
@dhirschfeld any idea when this is going to make it to a release? Looks like it didn't go into 3.2.0 as I am unable to poetry install databricks-sql-connector in a project that includes pandas 2.2.2
I'm not a maintainer here so I couldn't say.
I was hoping to do some more testing at some point, but haven't found the time.