kedro
kedro copied to clipboard
Database Connection Failure on AML clusters using kedro `ThreadRunner`
Description
I have an issue running Kedro with ThreadRunner
to execute the following pipeline:
The primary layer shown in the Kedro Viz above is a series of 21 SQLScriptDataset
objects (a pandas.sql_dataset.SQLQueryDataset
subclass which formats input queries in a special way using parameters in the catalog and then calls super().__init__
).
This Kedro pipeline is triggered as part of a CommandJob
in Azure Machine Learning (AML), using a command_job.py
which runs a Kedro session with something like this:
if __name__ == "__main__":
...
with KedroSession.create() as session:
session.run(...,runner = runner)
Problem/Error
After most or all of the datasets in the primary layer are loaded, SQLAlchemy produces the following error:
sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)
Context
In AML, these jobs can be run on two types of compute: a Compute Instance, which is an Ubuntu VM used for development, and Clusters, which are managed infrastructures that allow for the creation of single/multi-node computes for deployment.
When executing the CommandJob
, essentially running kedro run
with ThreadRunner
on a Cluster, the job fails. However, this issue does not occur when running the same job on a Compute Instance, or when run locally from source using kedro run
.
These command jobs run with the same environment image in both cases.
Steps to Reproduce
- Set up an AML Cluster with the specified environment.
- Execute the
CommandJob
to run the Kedro pipeline withThreadRunner
. - Observe the error.
Expected Result
I would expect the job to run successfully on the cluster, as it does on other compute instances with the same configuration.
Actual Result
The job fails with the following error:
sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)
Attempts to Resolve
- Using a small
max_workers
in the runner configuration. - Used connection parameters for the engine
- Tried different engine parameters, including:
{
"pool_size": 5,
"max_overflow": 10,
"pool_timeout": 120,
"pool_recycle": 1800,
"pool_reset_on_return": null,
"isolation_level": "AUTOCOMMIT",
"pool_pre_ping": true,
"echo_pool": "debug"
}
- Different Oracle (ugh, I know) drivers
- Different versions of
oracledb
andcx-Oracle
no luck.
Logs
Here is a log file of a run with 'echo_pool': 'debug'
and a similar setup, with 5 SQLScriptDataset
as input.
Running in AzureML.log
Your Environment
- Python version: 3.11.9
-
kedro
version: 0.19.6 -
kedro-datasets
version: 3.0.0 -
cx-Oracle
version: 8.3.0 - OS: mcr.microsoft.com/azureml/inference-base-2204:20240531.v1
- Oracle client: oracle/instantclient_21_8
- Azure ML compute type:
Standard_D16_v3
- Oracle DB version: 19