kedro-plugins
kedro-plugins copied to clipboard
feat(datasets): Enrich databricks connect error message
Description
Resolves #1038
Development notes
Catches error raised by databricks-connect and reraises with suggestions on resolution.
Checklist
- [x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
- [x] Updated the documentation to reflect the code changes
- [ ] Updated
jsonschema/kedro-catalog-X.XX.jsonif necessary - [x] Added a description of this change in the relevant
RELEASE.mdfile - [x] Added tests to cover my changes (no tests exist for
get_sparkso I'm marking this as done) - [ ] Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)
Any ideas why this might fail? Not sure this is caused by my PR, wdyt?
@merelcht
Any ideas why this might fail? Not sure this is caused by my PR, wdyt?
@merelcht
No this isn't related to your PR. This failure started showing up on our main branch too. We'll have a look at fixing it.
@star-yar, the test failures are resolved, but now there's a coverage issue which is related to the changes you made. Can you add tests to cover the behaviour?
Before implementing such a change, I wanted to share some additional learning.
We should not invoke Databricks Connect here. I think the newer Databricks Connect is intended for establishing the connection once and then retrieving the created session via pyspark.sql.SparkSesssion.
So, in the context of a kedro project, the user should create a hook that'll establish the connection via databricks connect. Then, the catalog code should always get spark session, not make it (meaning we need to remove databricks connect invocation completely), relying on the user creating the connection first.
So maybe we should mention this approach somewhere in docs. And rely on the fact that spark session is created not handling the creation case. This will impact the error message we output.
workflow now; no spark session pre-init
flowchart LR
n1["kedro session creates"]
n2["dataset invoked"]
n3["dataset creates session
through databricks-connect/pyspark"]
n1 --> n2 --> n3
suggested workflow; no spark session pre-init
flowchart LR
n1["kedro session creates"]
n2["dataset invoked"]
n3["dataset gets session
through pyspark*"]
n1 --> n2 --> n3
* if databricks-connect is installed, it'll complain that you're trying to create a session through pyspark - we handle this by raising the error and notifying user that he first needs to create a hook creating session. @merelcht
Any reflections @merelcht ?
Hi @star-yar , I'm really sorry for the late response.
You raise a very good point. In fact, this is what we recommend when using spark (without databricks) in the docs: https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#initialise-a-sparksession-using-a-hook
We've had an open issue to rewrite our spark datasets for ever (https://github.com/kedro-org/kedro-plugins/issues/135), but never got round to it so the Spark related datasets have just evolved through contributions but not with a proper architecture in mind.
We're currently working on a major release, but afterwards we can put this back on our priority list. You're also more than welcome to make a contribution if you have time.