DB engine for pandas: sql.connect or sqlalchemy

Open rth opened this issue 1 year ago • 1 comments

Hello,

I was wondering what's the best practice for using this package with pandas.

It's possible to create a databricks.sql.connect and pass it to pandas.read_sql. This works however it raises

UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 
connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.

Alternatively it's possible to use SQLAlchemy with a databricks:// URL and pass that to pandas. Doesn't it mean an extra serialization step performance wise though?

What's the recommended way, in particular regarding performance? Would both use CloudFetch for larger queries? I see there are some fixes/improvements done for pandas done in PRs so which API should be used to benefit from those?

Thanks!

cc @kravets-levko

Nov 27 '24 13:11 rth

Unless one is supposed to use fetchall_arrow and convert the resulting PyArrow table to pandas? Some example would be good (also in https://github.com/databricks/databricks-sql-python/issues/21)

Edit: Or actually some util function would be even better as proposed in https://github.com/databricks/databricks-sql-python/pull/134

Nov 29 '24 10:11 rth