databricks-dbapi
databricks-dbapi copied to clipboard
DBAPI and SQLAlchemy dialect for Databricks Workspace and SQL Analytics clusters
DEPRECATION WARNING: This package is no longer maintained. Databricks now officially maintains a DBAPI package called databricks-sql-connector <https://pypi.org/project/databricks-sql-connector/>
__ that is compatible with workspace and sql analytics clusters. There is also the newer sqlalchemy-databricks <https://pypi.org/project/sqlalchemy-databricks/>
__ package which uses the databricks-sql-connector
as a driver.
databricks-dbapi
|pypi| |pyversions|
.. |pypi| image:: https://img.shields.io/pypi/v/databricks-dbapi.svg :target: https://pypi.python.org/pypi/databricks-dbapi
.. |pyversions| image:: https://img.shields.io/pypi/pyversions/databricks-dbapi.svg :target: https://pypi.python.org/pypi/databricks-dbapi
A thin wrapper around pyhive <https://github.com/dropbox/PyHive>
__ and pyodbc <https://github.com/mkleehammer/pyodbc>
__ for creating a DBAPI <https://www.python.org/dev/peps/pep-0249/>
__ connection to Databricks Workspace and SQL Analytics clusters. SQL Analytics clusters require the Simba ODBC driver <https://databricks.com/spark/odbc-driver-download>
__.
Also provides SQLAlchemy Dialects using pyhive
and pyodbc
for Databricks clusters. Databricks SQL Analytics clusters only support the pyodbc
-driven dialect.
Installation
Install using pip. You must specify at least one of the extras {hive
or odbc
}. For odbc
the Simba driver <https://databricks.com/spark/odbc-driver-download>
__ is required:
.. code-block:: bash
pip install databricks-dbapi[hive,odbc]
For SQLAlchemy support install with:
.. code-block:: bash
pip install databricks-dbapi[hive,odbc,sqlalchemy]
Usage
PyHive
The ``connect()`` function returns a ``pyhive`` Hive connection object, which internally wraps a ``thrift`` connection.
Connecting with ``http_path``, ``host``, and a ``token``:
.. code-block:: python
import os
from databricks_dbapi import hive
token = os.environ["DATABRICKS_TOKEN"]
host = os.environ["DATABRICKS_HOST"]
http_path = os.environ["DATABRICKS_HTTP_PATH"]
connection = hive.connect(
host=host,
http_path=http_path,
token=token,
)
cursor = connection.cursor()
cursor.execute("SELECT * FROM some_table LIMIT 100")
print(cursor.fetchone())
print(cursor.fetchall())
The ``pyhive`` connection also provides async functionality:
.. code-block:: python
import os
from databricks_dbapi import hive
from TCLIService.ttypes import TOperationState
token = os.environ["DATABRICKS_TOKEN"]
host = os.environ["DATABRICKS_HOST"]
cluster = os.environ["DATABRICKS_CLUSTER"]
connection = hive.connect(
host=host,
cluster=cluster,
token=token,
)
cursor = connection.cursor()
cursor.execute("SELECT * FROM some_table LIMIT 100", async_=True)
status = cursor.poll().operationState
while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
logs = cursor.fetch_logs()
for message in logs:
print(message)
# If needed, an asynchronous query can be cancelled at any time with:
# cursor.cancel()
status = cursor.poll().operationState
print(cursor.fetchall())
ODBC
~~~~
The ODBC DBAPI requires the Simba ODBC driver.
Connecting with ``http_path``, ``host``, and a ``token``:
.. code-block:: python
import os
from databricks_dbapi import odbc
token = os.environ["DATABRICKS_TOKEN"]
host = os.environ["DATABRICKS_HOST"]
http_path = os.environ["DATABRICKS_HTTP_PATH"]
connection = odbc.connect(
host=host,
http_path=http_path,
token=token,
driver_path="/path/to/simba/driver",
)
cursor = connection.cursor()
cursor.execute("SELECT * FROM some_table LIMIT 100")
print(cursor.fetchone())
print(cursor.fetchall())
SQLAlchemy Dialects
-------------------
databricks+pyhive
Installing registers the databricks+pyhive
dialect/driver with SQLAlchemy. Fill in the required information when passing the engine URL.
.. code-block:: python
from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine(
"databricks+pyhive://token:<databricks_token>@<host>:<port>/<database>",
connect_args={"http_path": "<cluster_http_path>"}
)
logs = Table("my_table", MetaData(bind=engine), autoload=True)
print(select([func.count("*")], from_obj=logs).scalar())
databricks+pyodbc
Installing registers the ``databricks+pyodbc`` dialect/driver with SQLAlchemy. Fill in the required information when passing the engine URL.
.. code-block:: python
from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine(
"databricks+pyodbc://token:<databricks_token>@<host>:<port>/<database>",
connect_args={"http_path": "<cluster_http_path>", "driver_path": "/path/to/simba/driver"}
)
logs = Table("my_table", MetaData(bind=engine), autoload=True)
print(select([func.count("*")], from_obj=logs).scalar())
Refer to the following documentation for more details on hostname, cluster name, and http path:
* `Databricks <https://docs.databricks.com/user-guide/bi/jdbc-odbc-bi.html>`__
* `Azure Databricks <https://docs.azuredatabricks.net/user-guide/bi/jdbc-odbc-bi.html>`__
Related
-------
* `pyhive <https://github.com/dropbox/PyHive>`__
* `thrift <https://github.com/apache/thrift/tree/master/lib/py>`__
* `pyodbc <https://github.com/mkleehammer/pyodbc>`__