kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Update Databricks docs

Open AhdraMeraliQB opened this issue 1 year ago • 23 comments

Description

It looks like Databricks have deprecated their cli tools, which has had the knock on effect of breaking our docs. A quick fix in #3358 adds the necessary /archive/ to the broken links, but maybe we should rethink the section as a whole?

CC: @stichbury

docs: https://docs.kedro.org/en/stable/deployment/databricks/databricks_ide_development_workflow.html

AhdraMeraliQB avatar Nov 29 '23 13:11 AhdraMeraliQB

And also dbx:

Databricks recommends that you use Databricks asset bundles instead of dbx by Databricks Labs. See What are Databricks asset bundles? and Migrate from dbx to bundles.

https://docs.databricks.com/en/archive/dev-tools/dbx/index.html

astrojuanlu avatar Nov 29 '23 13:11 astrojuanlu

I'll take a look at this in an upcoming sprint -- we did some updates for the asset bundles recently as suggested by Harmony.

stichbury avatar Nov 30 '23 09:11 stichbury

Another couple of things I found in our Databricks workspace guide databricks_notebooks_development_workflow.md:

  • The first half of the guide, which tells users to set up a GitHub repository, a personal access token, push the code, then create a Databricks Repo, is not strictly needed: kedro new works fine on Databricks notebooks.
    • One has to be careful that the files are created in the Workspace, and not in the ephemeral driver storage. Depending on the runtime version, the default working directory is different, so a cd /Workspace/... might be needed.
  • "On Databricks, Kedro cannot access data stored directly in your project’s directory." this is not correct. From the docs:
    • "Spark cannot directly interact with workspace files on compute configured with shared access mode." however, clusters configured with Single User access mode should, and can, access workspace files.
    • However: “You can use Spark to read data files. You must provide Spark with the fully qualified path." https://learn.microsoft.com/en-us/azure/databricks/files/workspace-interact#read-data-workspace-files this means that spark.read.load("/Workspace/...") won't work (because it will assume dbfs:/), but spark.read.load("file:/Workspace/...") will.
      • Now, whether or not this can be incorporated in actual Kedro catalogs (in other words: whether or not our fsspec mangling will work on paths like these) is a different story. One can't just simply add file:/ in front of the dataset filepath, because then it will be taken as an absolute path and not a relative one.

It's true that creating a Databricks Repo synced with a GH repo gives some nice advantages, like being able to edit the code in an actual IDE (whether a local editor or a cloud development environment like Gitpod or GitHub Codespaces). And it's also true that Databricks recommends in different places that data should live in the DBFS root.

However, it would be nice to consider what's the shortest and simplest guide we can write for users to get started with Kedro on Databricks, and then build from there.

astrojuanlu avatar Dec 01 '23 18:12 astrojuanlu

To clarify on the initial comment:

  • The DBFS CLI (legacy) becomes https://docs.databricks.com/en/dev-tools/cli/fs-commands.html. The commands are still databricks fs, on first look nothing has changed on the CLI (only the structure of the documentation). For example, compare old and new.
  • The Jobs CLI (legacy) stops being documented because "CLI command groups that are not documented in the REST API reference have their own separate reference articles", and Jobs is not one of them. Documentation of the REST API can be found in https://docs.databricks.com/api/workspace/jobs, and the list of jobs subcommands is on GitHub at the moment https://github.com/databricks/cli/blob/main/docs/commands.md#databricks-jobs---manage-databricks-workflows. Less than ideal, but that's the current status AFAIU.

Both things above require the Databricks CLI version 0.205 and above. Apart from that, the commands haven't changed, so what we should do in this regard is making sure we're not sending users to legacy docs and that's it.

astrojuanlu avatar Dec 02 '23 23:12 astrojuanlu

To summarise:

  • [ ] Replace references to DBFS and Jobs legacy CLI docs
  • [ ] Rewrite docs from dbx to Asset Bundles (migration guide)
  • [ ] Simplify Databricks notebooks guide to better serve as starting point for Kedro on Databricks

astrojuanlu avatar Dec 02 '23 23:12 astrojuanlu

@astrojuanlu Is this something that the team can pick up or do we need to ask for time from Jannic or another databricks expert (maybe @deepyaman could ultimately review)?

What are we prioritising this? I'm guessing it's relatively high importance to keep databricks docs tracking with their tech.

stichbury avatar Dec 19 '23 12:12 stichbury

We need to build Databricks expertise in the team, so I hope we don't need to ask external experts to do it (it's OK if they give assistance, but we need to own this).

astrojuanlu avatar Dec 19 '23 16:12 astrojuanlu

Added this to the Inbox so that we prioritise.

astrojuanlu avatar Dec 19 '23 16:12 astrojuanlu

@astrojuanlu Is this something that the team can pick up or do we need to ask for time from Jannic or another databricks expert (maybe @deepyaman could ultimately review)?

At this point, it's been almost 4 years since I've used Databricks (and don't currently have any interest in getting back into it), so I'd defer to somebody else. 🙂

deepyaman avatar Dec 19 '23 17:12 deepyaman

More than fair enough @deepyaman! Good to confirm though.

stichbury avatar Dec 19 '23 17:12 stichbury

I'm adding one more item:

  • [ ] Document integration with Databricks Unity Catalog

Every time I give a talk or workshop, invariably somebody from the audience asks "how does the Kedro Catalog play along with Databricks Unity Catalog?".

Our reference docs for kedro-datasets mention it exactly once, in the API docs of pandas.DeltaTableDataset.

And there's one subtle mention of it in databricks.ManagedTableDataset ("the name of the catalog in Unity".

The broader question of Delta datasets is a topic for https://github.com/kedro-org/kedro-plugins/issues/542.

astrojuanlu avatar Feb 07 '24 09:02 astrojuanlu

Relevant: @dannyrfar 's https://github.com/dannyrfar/databricks-kedro-starter

astrojuanlu avatar Feb 21 '24 16:02 astrojuanlu

Maybe this could help: https://github.com/JenspederM/databricks-kedro-bundle

felipemonroy avatar Jun 01 '24 03:06 felipemonroy

This looks really cool. @jenspederM do you want to share a bit more insight on how far do you intend to go with your project?

astrojuanlu avatar Jun 03 '24 10:06 astrojuanlu

Hey @astrojuanlu

Actually I don't really know if there's more to do. I almost want the project to be as barebone as possible.

The way I left it now is with a very simple datasets implementation for unity so that people can customize as required.

As for the DAB resource generator, I'm considering if I could find a better way for users to set defaults such as which job clusters, instance pools, etc..

One thing that is generally lacking is the documentation so that will definitely receive some attention once I have the time.

Do you have any suggestions?

JenspederM avatar Jun 03 '24 11:06 JenspederM

However, it would be nice to consider what's the shortest and simplest guide we can write for users to get started with Kedro on Databricks, and then build from there.

I gave two Kedro on Databricks demos yesterday, so I'm sharing that very simple notebook here https://github.com/astrojuanlu/kedro-databricks-demo hopefully it can be the basis of what I proposed in https://github.com/kedro-org/kedro/issues/3360#issuecomment-1836553608 (still no Kedro Framework there)

astrojuanlu avatar Jul 05 '24 07:07 astrojuanlu

Hey @astrojuanlu

Actually I don't really know if there's more to do. I almost want the project to be as barebone as possible.

The way I left it now is with a very simple datasets implementation for unity so that people can customize as required.

As for the DAB resource generator, I'm considering if I could find a better way for users to set defaults such as which job clusters, instance pools, etc..

One thing that is generally lacking is the documentation so that will definitely receive some attention once I have the time.

Do you have any suggestions?

@JenspederM I gave your kedro-databricks a quick try yesterday and it didn't work out of the box, so if you're open to me opening issues, I'll gladly start doing so 😄

astrojuanlu avatar Jul 05 '24 07:07 astrojuanlu

@astrojuanlu Go for it!

I've been a bit busy these last few days and haven't had the chance to make any progress.

But it's always nice to have some concrete issues to address. 😉

JenspederM avatar Jul 05 '24 08:07 JenspederM

@astrojuanlu Just fyi, I'll merge a quite big PR soon so hopefully that will address most issues that you found.

The substitution algorithm was a bit more cumbersome than first anticipated..

JenspederM avatar Jul 07 '24 11:07 JenspederM