kedro-plugins
kedro-plugins copied to clipboard
Support for remote spark sessions and databricks-connect
Description
Since Spark 3.4, spark-connect (and the equivalent databricks-connect v2) were introduced for seamless development with remote Spark sessions.
This is extremely useful for interactive debugging of kedro pipelines from an IDE
However, the syntax for creating remote Spark Sessions was changed, and current kedro spark implementations have hard-coded the legacy SparkSession creation: SparkSession.builder.getOrCreate()
.
Therefore kedro is currently not supporting remote sessions.
I don't know about on premise spark setups, but with databricks-connect>=13.1 the current kedro spark datasets can't be instantiated on a remote session. When trying to instantiate a kedro spark dataset with databricks-connect>=13.1 the following error is raised:
RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Could not find connection parameters to start a Spark remote session.
Context
I suggest extending all spark-related code in kedro to support remote sessions
Possible Implementation
I work with a Databricks environment and this code works fine to instantiate a spark
object either in a remote databricks-connect>=13.0 session or a Databricks notebook directly from the web UI:
import pyspark
from pyspark.sql import SparkSession
def get_spark() -> Any:
"""
Retuns the SparkSession, we need this wrapper because the SparkSession
is retrieved differently in databricks-connect vs a Notebook in web UI
"""
# When in databriks-connect pyspark version is equal to DBR version
# instead of the actual pyspark version
pyspark_major_version = int(pyspark.__version__.split(".")[0])
if pyspark_major_version >= 13:
# In this case we are in a databricks-connect >= 13.0.0 (a.k.a databricks-connect-v2)
# remote session, and therefore spark is initialized differently
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
else:
# For sessions in Notebook web UI or previous versions on databricks-connect
# we get spark normally
spark = SparkSession.builder.getOrCreate()
return spark
This could probably be extended or improved to support all different environments (remote Session to on premise Spark cluster, databricks-connect<=12.2, etc.)
A quick fix would be to update kedro's get_spark() function to this one (or a more flexible and better implementation), and maybe consider moving it to an independent module instead of each dataset having its own implementation of get_spark()
Possible Alternatives
I would love to hear kedro developers opinions on this and design a robust solution that fully supports remote spark-connect and databricks-connect sessions
Realistically does everything stay the same except the session creation step? I am fine with adding support, the problem tho is that we don't have a databrick environments that we can put into automated testing and it's hard to know when they break. The latest pyspark version is 3.4.x, is this a correct snippet to get the databrick runtime version?
Questions:
- Does this change in different version of Databricks Runtime?
- I believe your snippet only show DatabricksConnect, so SparkConnect is working out of the box?
and maybe consider moving it to an independent module instead of each dataset having its own implementation of get_spark()
This is a good idea, we also sharing some logic of parsing path for Spark. You should probably test & look at kedro-datasets
instead of the kedro.extras.dataset
. This will be deprecated soon and have not been update for a year. The kedro-datasets
is the new repository that we keep datasets change. https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/spark/spark_dataset.py
Let me try to cover all your questions:
-
Yes, the only difference between a remote and a normal Spark session is getting the
spark
object. After that everything stays the same -
The snippet trick of checking the version works, because when you install databricks-connect, the pyspark python module version is the version of databricks-connect module. So with
databricks-connect==13.0.0
you getpyspark.__version__ == 13.0.0
in your python remote session even though the Databricks cluster has Spark 3.4.x under the hood -
In Databricks runtime<13.0 there was a legacy databricks-connect module, but syntax at the time was also the good old
spark = SparkSession.builder.getOrCreate()
, so it should work fine on any databricks runtime version -
I can't test vanilla Spark Connect because I don't have access to a Standalone spark cluster with spark>3.4. Current implementation wouldn't work anyway so this change should not make things worse for non-databricks users
-
Should we migrate the issue to the other repo or keep discussing here?
We can keep the discussion here, I can transfer the issue to the correct repository.
Basically I don't have problem with this, because it won't affect existing user but add new support to DatabricksConnect.
I am slightly uncomfortable with the way to check the version. It feels a bit hacky, is there any environment variable can tells that it is initialised as databricks connect? Otherwise go ahead and create the PR.
If I understand correctly it is all handled by databricks so no new dependencies are added. We can keep this change in existing sparks datasets.
I know the version check feels hacky, but I didn't find a more elegant way to do it.
At least it should work until spark reaches v13, which should take a lil while 😄
Any suggestion on where to put the generic get_spark
function to not have the duplicated code in every dataset?
Databricks users should install databricks-connect package, but since the import is inside the function there are no new kedro dependencies needed for non databricks-connect users
Maybe somewhere within the spark/ modules, since datasets can be installed independently.
How about trying to import DatabricksSession
and falling back to SparkSession
if the import fails? DatabricksSession
is available only in databricks-connect-v2
.
Anyway, currently kedro fixes dependency on spark to pyspark>=2.2, <3.4
. Isn't that kind of a blocker?
@KrzysztofDoboszInpost it's not, our CI is running on pyspark 3.4
SPARK = "pyspark>=2.2, <4.0"
Hi @noklam and @KrzysztofDoboszInpost , I have updated the PR with your suggestions. I appreciate your review, let me know what you think
DatabricksSession.builder.getOrCreate()
with fallback to SparkSession.builder.getOrCreate()
in case of import error should be good implementation.
DatabricksSession
supports configuration by SPARK_REMOTE
exactly as in SparkSession, while also supporting additional configuration mechanics and it also detects notebooks.
Re-opening, because as @DimedS said #352 "enables the use of remote Spark sessions with Databricks. However, we should keep the initial issue open, as this PR doesn't facilitate remote Spark sessions with solutions other than Databricks."
Hi, I appreciate this is a fairly old issue, but I'm encountering a related problem when trying to define both kedro-datasets
and databricks-connect
as dependencies for my project.
I'm using uv
to manage these and here is a minimal reproducible pyproject.toml
[project]
name = "uv-spark-test"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
"databricks-connect==14.3.2",
"kedro-datasets[databricks]>=4.2.0",
]
[tool.uv.sources]
kedro-datasets = { git = "https://github.com/kedro-org/kedro-plugins.git", subdirectory = "kedro-datasets" }
The issue seems to stem from the fact that both libraries depend on pyspark 3.5 but conflicting versions of this library.
Is there a better way to handle this conflict or is this out of scope for kedro-datasets
?