databricks-sql-python
databricks-sql-python copied to clipboard
DB engine for pandas: sql.connect or sqlalchemy
Hello,
I was wondering what's the best practice for using this package with pandas.
- It's possible to create a
databricks.sql.connectand pass it topandas.read_sql. This works however it raises
UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2
connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
- Alternatively it's possible to use SQLAlchemy with a
databricks://URL and pass that to pandas. Doesn't it mean an extra serialization step performance wise though?
What's the recommended way, in particular regarding performance? Would both use CloudFetch for larger queries? I see there are some fixes/improvements done for pandas done in PRs so which API should be used to benefit from those?
Thanks!
cc @kravets-levko
Unless one is supposed to use fetchall_arrow and convert the resulting PyArrow table to pandas? Some example would be good (also in https://github.com/databricks/databricks-sql-python/issues/21)
Edit: Or actually some util function would be even better as proposed in https://github.com/databricks/databricks-sql-python/pull/134