kedro Database Connection Failure on AML clusters using kedro `ThreadRunner`

Database Connection Failure on AML clusters using kedro `ThreadRunner`

Open gitgud5000 opened this issue 8 months ago • 4 comments

Description

I have an issue running Kedro with ThreadRunner to execute the following pipeline: Pasted image 20240613004553

The primary layer shown in the Kedro Viz above is a series of 21 SQLScriptDataset objects (a pandas.sql_dataset.SQLQueryDataset subclass which formats input queries in a special way using parameters in the catalog and then calls super().__init__).

This Kedro pipeline is triggered as part of a CommandJob in Azure Machine Learning (AML), using a command_job.py which runs a Kedro session with something like this:

if __name__ == "__main__":
...
with KedroSession.create() as session:
	session.run(...,runner = runner)

Problem/Error

After most or all of the datasets in the primary layer are loaded, SQLAlchemy produces the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Context

In AML, these jobs can be run on two types of compute: a Compute Instance, which is an Ubuntu VM used for development, and Clusters, which are managed infrastructures that allow for the creation of single/multi-node computes for deployment.

When executing the CommandJob, essentially running kedro run with ThreadRunner on a Cluster, the job fails. However, this issue does not occur when running the same job on a Compute Instance, or when run locally from source using kedro run.

These command jobs run with the same environment image in both cases.

Steps to Reproduce

Set up an AML Cluster with the specified environment.
Execute the CommandJob to run the Kedro pipeline with ThreadRunner.
Observe the error.

Expected Result

I would expect the job to run successfully on the cluster, as it does on other compute instances with the same configuration.

Actual Result

The job fails with the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Attempts to Resolve

Using a small max_workers in the runner configuration.
Used connection parameters for the engine
Tried different engine parameters, including:

{
    "pool_size": 5,
    "max_overflow": 10,
    "pool_timeout": 120,
    "pool_recycle": 1800,
    "pool_reset_on_return": null,
    "isolation_level": "AUTOCOMMIT",
    "pool_pre_ping": true,
    "echo_pool": "debug"
}

Different Oracle (ugh, I know) drivers
Different versions of oracledb and cx-Oracle no luck.

Logs

Here is a log file of a run with 'echo_pool': 'debug' and a similar setup, with 5 SQLScriptDataset as input. Running in AzureML.log Pasted image 20240613014731

Your Environment

Python version: 3.11.9
kedro version: 0.19.6
kedro-datasets version: 3.0.0
cx-Oracle version: 8.3.0
OS: mcr.microsoft.com/azureml/inference-base-2204:20240531.v1
Oracle client: oracle/instantclient_21_8
Azure ML compute type: Standard_D16_v3
Oracle DB version: 19

Jun 13 '24 01:06 gitgud5000

kedro kedro copied to clipboard

Database Connection Failure on AML clusters using kedro `ThreadRunner`

Description

Problem/Error

Context

Steps to Reproduce

Expected Result

Actual Result

Attempts to Resolve

Logs

Your Environment

kedro
kedro copied to clipboard