kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Database Connection Failure on AML clusters using kedro `ThreadRunner`

Open gitgud5000 opened this issue 8 months ago • 4 comments

Description

I have an issue running Kedro with ThreadRunner to execute the following pipeline: Pasted image 20240613004553

The primary layer shown in the Kedro Viz above is a series of 21 SQLScriptDataset objects (a pandas.sql_dataset.SQLQueryDataset subclass which formats input queries in a special way using parameters in the catalog and then calls super().__init__).

This Kedro pipeline is triggered as part of a CommandJob in Azure Machine Learning (AML), using a command_job.py which runs a Kedro session with something like this:

if __name__ == "__main__":
...
with KedroSession.create() as session:
	session.run(...,runner = runner)

Problem/Error

After most or all of the datasets in the primary layer are loaded, SQLAlchemy produces the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Context

In AML, these jobs can be run on two types of compute: a Compute Instance, which is an Ubuntu VM used for development, and Clusters, which are managed infrastructures that allow for the creation of single/multi-node computes for deployment.

When executing the CommandJob, essentially running kedro run with ThreadRunner on a Cluster, the job fails. However, this issue does not occur when running the same job on a Compute Instance, or when run locally from source using kedro run.

These command jobs run with the same environment image in both cases.

Steps to Reproduce

  • Set up an AML Cluster with the specified environment.
  • Execute the CommandJob to run the Kedro pipeline with ThreadRunner.
  • Observe the error.

Expected Result

I would expect the job to run successfully on the cluster, as it does on other compute instances with the same configuration.

Actual Result

The job fails with the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Attempts to Resolve

  • Using a small max_workers in the runner configuration.
  • Used connection parameters for the engine
  • Tried different engine parameters, including:
{
    "pool_size": 5,
    "max_overflow": 10,
    "pool_timeout": 120,
    "pool_recycle": 1800,
    "pool_reset_on_return": null,
    "isolation_level": "AUTOCOMMIT",
    "pool_pre_ping": true,
    "echo_pool": "debug"
}
  • Different Oracle (ugh, I know) drivers
  • Different versions of oracledb and cx-Oracle no luck.

Logs

Here is a log file of a run with 'echo_pool': 'debug' and a similar setup, with 5 SQLScriptDataset as input. Running in AzureML.log Pasted image 20240613014731

Your Environment

  • Python version: 3.11.9
  • kedro version: 0.19.6
  • kedro-datasets version: 3.0.0
  • cx-Oracle version: 8.3.0
  • OS: mcr.microsoft.com/azureml/inference-base-2204:20240531.v1
  • Oracle client: oracle/instantclient_21_8
  • Azure ML compute type: Standard_D16_v3
  • Oracle DB version: 19

gitgud5000 avatar Jun 13 '24 01:06 gitgud5000