kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

SparkDataset failing with databricks-connect serveless cluster

Open star-yar opened this issue 8 months ago • 5 comments

Description

SparkDataset isn't working with serverless cluster retrieved by databricks-connect. Is there a way to overwrite the way spark session is created in the dataset class?

Context

I'm trying to run simple kedro pipeline locally. It uses the catalog item of the type SparkDataset. In this project I create spark session via databricks-connect as:

DatabricksSession.builder.profile("profile_name").serverless(enabled=True).getOrCreate()

But the get_spark function inside the dataset does:

DatabricksSession.builder.getOrCreate()

And my pipeline fails with: Cluster id or serverless are required but were not specified

Steps to Reproduce

  1. Install Databricks-connect
  2. Authenticate in the workspace
  3. Create serverless executor in the Databricks workspace
  4. Add data catalog item "dataset" to catalog, with type SparkDataset
  5. Define simple pipeline:pipeline([node(lambda x: x), inputs="dataset", outputs="dataset"])
  6. Try to run kedro pipeline

Expected Result

The pipeline runs w/o issues

Actual Result

The pipeline runs with Cluster id or serverless are required but was not specified

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Operating system and version: Win 10

python==3.10.10
Kedro==0.19.11
kedro-datasets==6.0.0
databricks-connect==16.1.0
databricks-sdk==0.40.0

Related to #700

star-yar avatar Mar 11 '25 19:03 star-yar

Hi @star-yar, thanks for flagging this issue. It sounds like the best solution is to allow passing a session to the dataset or you could monkey patch it. We'd be more than happy to accept a PR to fix the issue.

merelcht avatar Mar 12 '25 16:03 merelcht

After some experimenting, I was able to make it work. Users can make this behave as a singleton by setting environment variables DATABRICKS_CLUSTER_ID and DATABRICKS_CLUSTER_ID .

Still, an optional session arg would make sense I think. How do you see the implementation from the data catalog usage point? Would user provide session like that?

dataset:
  type: spark.SparkDataset
  spark_session_builder: DatabricksSession.builder.profile("profile_name").serverless(enabled=True)

star-yar avatar Mar 13 '25 12:03 star-yar

Glad you managed to get it working @star-yar ! I was thinking maybe nesting it so something like:

dataset:
  type: spark.SparkDataset
  spark_session_builder: 
     profile: profile_name
     serverless: True

merelcht avatar Mar 14 '25 11:03 merelcht

So after some thinking I’m leaning towards keeping the current interface. The downside of propagating any args/builder or any additional setup information is that you’ll have to replicate that across all the configs and then you’ll need to somehow manage that. Yes, global variable would be an option. But that’s a lot of boilerplate code still even with the variable propagation.

My suggestion will be to catch an error like the one I’m experiencing and extend it with suggestion to user setting up appropriate env variables (typically on kedro hooks). So that databricks session builder stays singleton. WDYT?

star-yar avatar Mar 14 '25 12:03 star-yar

Hmm yes that's a good point. It's not great to bloat the catalog and it's technically not really dataset level config, but higher level. It makes sense to expand the error message. Are you interested in opening a PR for that?

merelcht avatar Mar 14 '25 13:03 merelcht