kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

feat(datasets): Enrich databricks connect error message

Open star-yar opened this issue 8 months ago • 6 comments

Description

Resolves #1038

Development notes

Catches error raised by databricks-connect and reraises with suggestions on resolution.

Checklist

  • [x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • [x] Updated the documentation to reflect the code changes
  • [ ] Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • [x] Added a description of this change in the relevant RELEASE.md file
  • [x] Added tests to cover my changes (no tests exist for get_spark so I'm marking this as done)
  • [ ] Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

star-yar avatar Mar 14 '25 20:03 star-yar

Any ideas why this might fail? Not sure this is caused by my PR, wdyt?

@merelcht

star-yar avatar Mar 14 '25 22:03 star-yar

Any ideas why this might fail? Not sure this is caused by my PR, wdyt?

@merelcht

No this isn't related to your PR. This failure started showing up on our main branch too. We'll have a look at fixing it.

merelcht avatar Mar 17 '25 09:03 merelcht

@star-yar, the test failures are resolved, but now there's a coverage issue which is related to the changes you made. Can you add tests to cover the behaviour?

merelcht avatar Mar 20 '25 10:03 merelcht

Before implementing such a change, I wanted to share some additional learning.

We should not invoke Databricks Connect here. I think the newer Databricks Connect is intended for establishing the connection once and then retrieving the created session via pyspark.sql.SparkSesssion.

So, in the context of a kedro project, the user should create a hook that'll establish the connection via databricks connect. Then, the catalog code should always get spark session, not make it (meaning we need to remove databricks connect invocation completely), relying on the user creating the connection first.

So maybe we should mention this approach somewhere in docs. And rely on the fact that spark session is created not handling the creation case. This will impact the error message we output.

workflow now; no spark session pre-init

flowchart LR
    n1["kedro session creates"]
    n2["dataset invoked"]
    n3["dataset creates session
 through databricks-connect/pyspark"]
    n1 --> n2 --> n3

suggested workflow; no spark session pre-init

flowchart LR
    n1["kedro session creates"]
    n2["dataset invoked"]
    n3["dataset gets session
 through pyspark*"]
    n1 --> n2 --> n3

* if databricks-connect is installed, it'll complain that you're trying to create a session through pyspark - we handle this by raising the error and notifying user that he first needs to create a hook creating session. @merelcht

star-yar avatar Mar 28 '25 14:03 star-yar

Any reflections @merelcht ?

star-yar avatar Apr 16 '25 14:04 star-yar

Hi @star-yar , I'm really sorry for the late response.

You raise a very good point. In fact, this is what we recommend when using spark (without databricks) in the docs: https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#initialise-a-sparksession-using-a-hook

We've had an open issue to rewrite our spark datasets for ever (https://github.com/kedro-org/kedro-plugins/issues/135), but never got round to it so the Spark related datasets have just evolved through contributions but not with a proper architecture in mind.

We're currently working on a major release, but afterwards we can put this back on our priority list. You're also more than welcome to make a contribution if you have time.

merelcht avatar May 30 '25 11:05 merelcht