kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

Support for remote spark sessions and databricks-connect

Open MigQ2 opened this issue 1 year ago • 11 comments

Description

Since Spark 3.4, spark-connect (and the equivalent databricks-connect v2) were introduced for seamless development with remote Spark sessions.

This is extremely useful for interactive debugging of kedro pipelines from an IDE

However, the syntax for creating remote Spark Sessions was changed, and current kedro spark implementations have hard-coded the legacy SparkSession creation: SparkSession.builder.getOrCreate().

Therefore kedro is currently not supporting remote sessions.

I don't know about on premise spark setups, but with databricks-connect>=13.1 the current kedro spark datasets can't be instantiated on a remote session. When trying to instantiate a kedro spark dataset with databricks-connect>=13.1 the following error is raised:

RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Could not find connection parameters to start a Spark remote session.

Context

I suggest extending all spark-related code in kedro to support remote sessions

Possible Implementation

I work with a Databricks environment and this code works fine to instantiate a spark object either in a remote databricks-connect>=13.0 session or a Databricks notebook directly from the web UI:

import pyspark
from pyspark.sql import SparkSession


def get_spark() -> Any:
    """
    Retuns the SparkSession, we need this wrapper because the SparkSession
    is retrieved differently in databricks-connect vs a Notebook in web UI
    """
    # When in databriks-connect pyspark version is equal to DBR version
    # instead of the actual pyspark version
    pyspark_major_version = int(pyspark.__version__.split(".")[0])
    if pyspark_major_version >= 13:
        # In this case we are in a databricks-connect >= 13.0.0 (a.k.a databricks-connect-v2)
        # remote session, and therefore spark is initialized differently
        from databricks.connect import DatabricksSession

        spark = DatabricksSession.builder.getOrCreate()
    else:
        # For sessions in Notebook web UI or previous versions on databricks-connect
        # we get spark normally
        spark = SparkSession.builder.getOrCreate()

    return spark

This could probably be extended or improved to support all different environments (remote Session to on premise Spark cluster, databricks-connect<=12.2, etc.)

A quick fix would be to update kedro's get_spark() function to this one (or a more flexible and better implementation), and maybe consider moving it to an independent module instead of each dataset having its own implementation of get_spark()

Possible Alternatives

I would love to hear kedro developers opinions on this and design a robust solution that fully supports remote spark-connect and databricks-connect sessions

MigQ2 avatar Sep 27 '23 21:09 MigQ2

Realistically does everything stay the same except the session creation step? I am fine with adding support, the problem tho is that we don't have a databrick environments that we can put into automated testing and it's hard to know when they break. The latest pyspark version is 3.4.x, is this a correct snippet to get the databrick runtime version?

Questions:

  • Does this change in different version of Databricks Runtime?
  • I believe your snippet only show DatabricksConnect, so SparkConnect is working out of the box?

and maybe consider moving it to an independent module instead of each dataset having its own implementation of get_spark()

This is a good idea, we also sharing some logic of parsing path for Spark. You should probably test & look at kedro-datasets instead of the kedro.extras.dataset. This will be deprecated soon and have not been update for a year. The kedro-datasets is the new repository that we keep datasets change. https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/spark/spark_dataset.py

noklam avatar Sep 27 '23 22:09 noklam

Let me try to cover all your questions:

  1. Yes, the only difference between a remote and a normal Spark session is getting the spark object. After that everything stays the same

  2. The snippet trick of checking the version works, because when you install databricks-connect, the pyspark python module version is the version of databricks-connect module. So with databricks-connect==13.0.0 you get pyspark.__version__ == 13.0.0 in your python remote session even though the Databricks cluster has Spark 3.4.x under the hood

  3. In Databricks runtime<13.0 there was a legacy databricks-connect module, but syntax at the time was also the good old spark = SparkSession.builder.getOrCreate(), so it should work fine on any databricks runtime version

  4. I can't test vanilla Spark Connect because I don't have access to a Standalone spark cluster with spark>3.4. Current implementation wouldn't work anyway so this change should not make things worse for non-databricks users

  5. Should we migrate the issue to the other repo or keep discussing here?

MigQ2 avatar Sep 27 '23 22:09 MigQ2

We can keep the discussion here, I can transfer the issue to the correct repository.

noklam avatar Sep 27 '23 22:09 noklam

Basically I don't have problem with this, because it won't affect existing user but add new support to DatabricksConnect.

I am slightly uncomfortable with the way to check the version. It feels a bit hacky, is there any environment variable can tells that it is initialised as databricks connect? Otherwise go ahead and create the PR.

If I understand correctly it is all handled by databricks so no new dependencies are added. We can keep this change in existing sparks datasets.

noklam avatar Sep 27 '23 22:09 noklam

I know the version check feels hacky, but I didn't find a more elegant way to do it.

At least it should work until spark reaches v13, which should take a lil while 😄

Any suggestion on where to put the generic get_spark function to not have the duplicated code in every dataset?

Databricks users should install databricks-connect package, but since the import is inside the function there are no new kedro dependencies needed for non databricks-connect users

MigQ2 avatar Sep 27 '23 22:09 MigQ2

Maybe somewhere within the spark/ modules, since datasets can be installed independently.

noklam avatar Sep 27 '23 23:09 noklam

How about trying to import DatabricksSession and falling back to SparkSession if the import fails? DatabricksSession is available only in databricks-connect-v2.

Anyway, currently kedro fixes dependency on spark to pyspark>=2.2, <3.4. Isn't that kind of a blocker?

KrzysztofDoboszInpost avatar Sep 28 '23 08:09 KrzysztofDoboszInpost

@KrzysztofDoboszInpost it's not, our CI is running on pyspark 3.4

SPARK = "pyspark>=2.2, <4.0"

noklam avatar Sep 28 '23 08:09 noklam

Hi @noklam and @KrzysztofDoboszInpost , I have updated the PR with your suggestions. I appreciate your review, let me know what you think

MigQ2 avatar Oct 03 '23 16:10 MigQ2

DatabricksSession.builder.getOrCreate() with fallback to SparkSession.builder.getOrCreate() in case of import error should be good implementation.

DatabricksSession supports configuration by SPARK_REMOTE exactly as in SparkSession, while also supporting additional configuration mechanics and it also detects notebooks.

cdkrot avatar Oct 04 '23 19:10 cdkrot

Re-opening, because as @DimedS said #352 "enables the use of remote Spark sessions with Databricks. However, we should keep the initial issue open, as this PR doesn't facilitate remote Spark sessions with solutions other than Databricks."

merelcht avatar Nov 01 '23 14:11 merelcht

Hi, I appreciate this is a fairly old issue, but I'm encountering a related problem when trying to define both kedro-datasets and databricks-connect as dependencies for my project.

I'm using uv to manage these and here is a minimal reproducible pyproject.toml

[project]
name = "uv-spark-test"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
    "databricks-connect==14.3.2",
    "kedro-datasets[databricks]>=4.2.0",
]

[tool.uv.sources]
kedro-datasets = { git = "https://github.com/kedro-org/kedro-plugins.git", subdirectory = "kedro-datasets" }

The issue seems to stem from the fact that both libraries depend on pyspark 3.5 but conflicting versions of this library.

Is there a better way to handle this conflict or is this out of scope for kedro-datasets?

jstammers avatar Oct 09 '24 13:10 jstammers