kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

Support for remote spark sessions and databricks-connect

Open MigQ2 opened this issue 11 months ago • 11 comments

Description

Since Spark 3.4, spark-connect (and the equivalent databricks-connect v2) were introduced for seamless development with remote Spark sessions.

This is extremely useful for interactive debugging of kedro pipelines from an IDE

However, the syntax for creating remote Spark Sessions was changed, and current kedro spark implementations have hard-coded the legacy SparkSession creation: SparkSession.builder.getOrCreate().

Therefore kedro is currently not supporting remote sessions.

I don't know about on premise spark setups, but with databricks-connect>=13.1 the current kedro spark datasets can't be instantiated on a remote session. When trying to instantiate a kedro spark dataset with databricks-connect>=13.1 the following error is raised:

RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Could not find connection parameters to start a Spark remote session.

Context

I suggest extending all spark-related code in kedro to support remote sessions

Possible Implementation

I work with a Databricks environment and this code works fine to instantiate a spark object either in a remote databricks-connect>=13.0 session or a Databricks notebook directly from the web UI:

import pyspark
from pyspark.sql import SparkSession


def get_spark() -> Any:
    """
    Retuns the SparkSession, we need this wrapper because the SparkSession
    is retrieved differently in databricks-connect vs a Notebook in web UI
    """
    # When in databriks-connect pyspark version is equal to DBR version
    # instead of the actual pyspark version
    pyspark_major_version = int(pyspark.__version__.split(".")[0])
    if pyspark_major_version >= 13:
        # In this case we are in a databricks-connect >= 13.0.0 (a.k.a databricks-connect-v2)
        # remote session, and therefore spark is initialized differently
        from databricks.connect import DatabricksSession

        spark = DatabricksSession.builder.getOrCreate()
    else:
        # For sessions in Notebook web UI or previous versions on databricks-connect
        # we get spark normally
        spark = SparkSession.builder.getOrCreate()

    return spark

This could probably be extended or improved to support all different environments (remote Session to on premise Spark cluster, databricks-connect<=12.2, etc.)

A quick fix would be to update kedro's get_spark() function to this one (or a more flexible and better implementation), and maybe consider moving it to an independent module instead of each dataset having its own implementation of get_spark()

Possible Alternatives

I would love to hear kedro developers opinions on this and design a robust solution that fully supports remote spark-connect and databricks-connect sessions

MigQ2 avatar Sep 27 '23 21:09 MigQ2