databricks-sql-python
databricks-sql-python copied to clipboard
High memory use / process hang indefinitely when token is invalid
See https://github.com/databricks/dbt-databricks/issues/388
It looks like the high memory use due to this library.
Reproduction
Setup env:
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install memory_profiler matplotlib databricks-sql-connector
Use the sample script exactly as in the quickstart (https://github.com/databricks/databricks-sql-python#quickstart). Importantly, make sure the access token is not valid, so export something random like export DATABRICKS_TOKEN=whatever
.
$ mprof run python quickstart.py
$ mrof plot
Using @jeremyyeo 's repro steps, I found a large memory allocation (2065851762 bytes) at this call site:
/root/memray-env/lib/python3.9/site-packages/dbt/adapters/base/connections.py(241)retry_connection()
-> connection.handle = connect()
/root/memray-env/lib/python3.9/site-packages/dbt/adapters/databricks/connections.py(740)connect()
-> conn: DatabricksSQLConnection = dbsql.connect(
/root/memray-env/lib/python3.9/site-packages/databricks/sql/__init__.py(50)connect()
-> return Connection(server_hostname, http_path, access_token, **kwargs)
/root/memray-env/lib/python3.9/site-packages/databricks/sql/client.py(189)__init__()
-> self._session_handle = self.thrift_backend.open_session(
/root/memray-env/lib/python3.9/site-packages/databricks/sql/thrift_backend.py(506)open_session()
-> response = self.make_request(self._client.OpenSession, open_session_req)
/root/memray-env/lib/python3.9/site-packages/databricks/sql/thrift_backend.py(423)make_request()
-> response_or_error_info = attempt_request(attempt)
/root/memray-env/lib/python3.9/site-packages/databricks/sql/thrift_backend.py(341)attempt_request()
-> response = method(request)
/root/memray-env/lib/python3.9/site-packages/databricks/sql/thrift_api/TCLIService/TCLIService.py(213)OpenSession()
-> return self.recv_OpenSession()
/root/memray-env/lib/python3.9/site-packages/databricks/sql/thrift_api/TCLIService/TCLIService.py(225)recv_OpenSession()
-> (fname, mtype, rseqid) = iprot.readMessageBegin()
/root/memray-env/lib/python3.9/site-packages/thrift/protocol/TBinaryProtocol.py(148)readMessageBegin()
-> name = self.trans.readAll(sz)
/root/memray-env/lib/python3.9/site-packages/thrift/transport/TTransport.py(62)readAll()
-> chunk = self.read(sz - have)
/root/memray-env/lib/python3.9/site-packages/databricks/sql/auth/thrift_http_client.py(123)read()
-> return self.__resp.read(sz)
/root/memray-env/lib/python3.9/site-packages/urllib3/response.py(567)read()
-> data = self._fp_read(amt) if not fp_closed else b""
/root/memray-env/lib/python3.9/site-packages/urllib3/response.py(533)_fp_read()
-> return self._fp.read(amt) if amt is not None else self._fp.read()
Unable to replicate this either on databricks-sql-python 2.5.2 (what was current at the time this issue was created) or the latest version (2.9.3). Running the quickstart code with an invalid token errors out quickly as you'd expect and uses a normal amount of memory. It's possible the issue is with another library or with Databricks itself.
Here's my pip freeze output. Below that I've included a screenshot of memory usage. Below that I've pasted the modified quickstart code I'm running--note that server hostname and HTTP path were correct but the access token was deliberately wrong. Running on macOS Big Sur 11.6.4, Intel chip.
alembic==1.12.0 certifi==2023.7.22 charset-normalizer==3.2.0 contourpy==1.1.0 cycler==0.11.0 databricks-sql-connector==2.5.2 et-xmlfile==1.1.0 fonttools==4.42.1 greenlet==2.0.2 idna==3.4 kiwisolver==1.4.5 lz4==4.3.2 Mako==1.2.4 MarkupSafe==2.1.3 matplotlib==3.7.2 memory-profiler==0.61.0 numpy==1.25.2 oauthlib==3.2.2 openpyxl==3.1.2 packaging==23.1 pandas==1.5.3 Pillow==10.0.0 psutil==5.9.5 pyarrow==13.0.0 pyparsing==3.0.9 python-dateutil==2.8.2 pytz==2023.3 requests==2.31.0 six==1.16.0 SQLAlchemy==1.4.49 thrift==0.16.0 typing_extensions==4.7.1 tzdata==2023.3 urllib3==2.0.4
import os
from databricks import sql
DATABRICKS_SERVER_HOSTNAME="dbc-<my-host>.cloud.databricks.com"
DATABRICKS_HTTP_PATH="/sql/1.0/endpoints/<my-path>"
DATABRICKS_ACCESS_TOKEN="dapiWRONGTOKEN" # wrong
host = DATABRICKS_SERVER_HOSTNAME
http_path = DATABRICKS_HTTP_PATH
access_token = DATABRICKS_ACCESS_TOKEN
connection = sql.connect(
server_hostname=host,
http_path=http_path,
access_token=access_token)
cursor = connection.cursor()
cursor.execute('SELECT * FROM RANGE(10)')
result = cursor.fetchall()
for row in result:
print(row)
cursor.close()
connection.close()
I'm seeing similar behavior with an invalid token but my connection just hangs indefinitely. The memory does not continue increasing though. I tested with @WilliamGentry's script. I'm on Mac OS Ventura 13.6 (intel) but I have also seen this happen to a colleague on an M2 mac
certifi==2023.11.17
charset-normalizer==3.3.2
contourpy==1.2.0
cycler==0.12.1
databricks-sql-connector==3.0.1
et-xmlfile==1.1.0
fonttools==4.47.0
idna==3.6
importlib-resources==6.1.1
kiwisolver==1.4.5
lz4==4.3.2
matplotlib==3.8.2
memory-profiler==0.61.0
numpy==1.26.2
oauthlib==3.2.2
openpyxl==3.1.2
packaging==23.2
pandas==2.1.4
Pillow==10.1.0
psutil==5.9.7
pyarrow==14.0.2
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
requests==2.31.0
six==1.16.0
thrift==0.16.0
tzdata==2023.3
urllib3==2.1.0
zipp==3.17.0
Just retested this and indeed experiencing what @dbkegley reported - an invalid token simply now stalls indefinitely.
^ CTRL+C'd to kill the process.
On Intel Mac (macOS 13.4.1):
$ python --version
Python 3.10.10
$ pip freeze
certifi==2023.11.17
charset-normalizer==3.3.2
contourpy==1.2.0
cycler==0.12.1
databricks-sql-connector==3.0.1
et-xmlfile==1.1.0
fonttools==4.47.2
idna==3.6
kiwisolver==1.4.5
lz4==4.3.3
matplotlib==3.8.2
memory-profiler==0.61.0
numpy==1.26.3
oauthlib==3.2.2
openpyxl==3.1.2
packaging==23.2
pandas==2.1.4
pillow==10.2.0
psutil==5.9.7
pyarrow==14.0.2
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
requests==2.31.0
six==1.16.0
thrift==0.16.0
tzdata==2023.4
urllib3==2.1.0
Additionally, I tested without installing memory_profiler
or matplotlib
(so just pip install databricks-sql-connector
) in a new virtual env as well to rule out any issues with those two libraries (or it's dependencies). The indefinite hang remain with just these:
$ pip freeze
certifi==2023.11.17
charset-normalizer==3.3.2
databricks-sql-connector==3.0.1
et-xmlfile==1.1.0
idna==3.6
lz4==4.3.3
numpy==1.26.3
oauthlib==3.2.2
openpyxl==3.1.2
pandas==2.1.4
pyarrow==14.0.2
python-dateutil==2.8.2
pytz==2023.3.post1
requests==2.31.0
six==1.16.0
thrift==0.16.0
tzdata==2023.4
urllib3==2.1.0