databricks-sql-python icon indicating copy to clipboard operation
databricks-sql-python copied to clipboard

Unpin `pandas`

Open dhirschfeld opened this issue 1 year ago • 4 comments

I would like to be able to use this library with the latest pandas version. Currently pandas is pinned to <2.2.0: https://github.com/databricks/databricks-sql-python/blob/05529900858d40add7bc9b7e4a8864921680cfa2/pyproject.toml#L14-L16

It would be good to remove this restriction.

dhirschfeld avatar Feb 01 '24 23:02 dhirschfeld

The pin was added in:

  • https://github.com/databricks/databricks-sql-python/pull/330

To fix the issue described in:

  • https://github.com/databricks/databricks-sql-python/issues/326

...but that just avoids the problem whilst causing another problem; this library can't be used with the latest pandas :/

dhirschfeld avatar Feb 01 '24 23:02 dhirschfeld

I'm opening this issue to track any progress towards compatibility with the latest pandas version.

dhirschfeld avatar Feb 01 '24 23:02 dhirschfeld

Bump! I would like to upgrade to the latest version but am stuck on 3.0.1 because of this pin 😔

dhirschfeld avatar Feb 23 '24 04:02 dhirschfeld

Does 3.0.1 work with latest pandas? That would be an interesting data point.

benc-db avatar Mar 27 '24 18:03 benc-db

Does 3.0.1 work with latest pandas? That would be an interesting data point.

I've been using 3.0.1 in combination with pandas 2.2.2 with no issues:

❯ pip list | rg 'pandas|databricks'
databricks-connect              14.3.1
databricks-sdk                  0.20.0
databricks-sql-connector        3.0.1
pandas                          2.2.2

...but that's apparently because I don't query all int data sources. Running:

with engine.connect() as conn:
    res = conn.execute(sa.text("select 1")).scalar_one()

gives:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

dhirschfeld avatar May 29 '24 06:05 dhirschfeld

It seems like it doesn't like assigning a None into an integer array:

> /opt/python/envs/dev310/lib/python3.10/site-packages/pandas/core/internals/managers.py(1703)as_array()
   1701             pass
   1702         else:
-> 1703             arr[isna(arr)] = na_value
   1704 
   1705         return arr.transpose()

ipdb>  arr
array([[1]], dtype=int32)

ipdb>  isna(arr)
array([[False]])

ipdb>  na_value

ipdb>  na_value is None
True

If we go up the stack we can see we get type errors if we try to assign anything other than an integer:

> /opt/python/envs/dev310/lib/python3.10/site-packages/databricks/sql/client.py(1149)_convert_arrow_table()
   1147         )
   1148 
-> 1149         res = df.to_numpy(na_value=None)
   1150         return [ResultRow(*v) for v in res]
   1151 

ipdb>  df.to_numpy(na_value=None)
*** TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

ipdb>  df.to_numpy(na_value=float('NaN'))
*** ValueError: cannot convert float NaN to integer

ipdb>  df.to_numpy(na_value=-99)
array([[1]], dtype=int32)

Casting to object before assigning does seem to work:

ipdb>  df.astype(object).to_numpy(na_value=None)
array([[1]], dtype=object)

dhirschfeld avatar May 29 '24 06:05 dhirschfeld

The problematic function: https://github.com/databricks/databricks-sql-python/blob/a6e9b11131871de8b673e3072c5b64498df68217/src/databricks/sql/client.py#L1130-L1166

dhirschfeld avatar May 29 '24 06:05 dhirschfeld

I can work around the issue by disabling pandas:

with engine.connect() as conn:
    cursor = conn.connection.cursor()
    cursor.connection.disable_pandas = True
    res = cursor.execute("select 1").fetchall()
>>> res
[Row(1=1)]

...but obviously the casting to numpy needs to be fixed.

dhirschfeld avatar May 29 '24 06:05 dhirschfeld

Probably casting to object before assigning a None value is the right fix.

dhirschfeld avatar May 29 '24 07:05 dhirschfeld

I second this. I cannot use pd.read_sql_query() because of this requirement.

Also, it would be good if you delete the distutils dependency

diego-jd avatar May 29 '24 18:05 diego-jd

@dhirschfeld any idea when this is going to make it to a release? Looks like it didn't go into 3.2.0 as I am unable to poetry install databricks-sql-connector in a project that includes pandas 2.2.2

Aryik avatar Jul 15 '24 23:07 Aryik

I'm not a maintainer here so I couldn't say.

I was hoping to do some more testing at some point, but haven't found the time.

dhirschfeld avatar Jul 16 '24 00:07 dhirschfeld