kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Packaged kedro pipeline does not work well on Databricks

Open noklam opened this issue 2 years ago • 20 comments

Originated from internal users https://quantumblack.slack.com/archives/C7JTFSFUY/p1661236540255789

  • Packaged kedro are run as a CLI -> Click generate a sys.exit() and cause some problem on databricks.

noklam avatar Aug 24 '22 09:08 noklam

This is one of the things I have fixed in https://github.com/kedro-org/kedro/pull/1423. I really need to write it up so we can fix it properly (I have a better fix than that PR). Let me check the slack thread now anyway.

antonymilne avatar Aug 24 '22 09:08 antonymilne

Add some more comments to document what happened. In summary, there are 2 issues:

  1. The CLI mode generate some sys.exit() which some platform doesn't welcome (i.e. databricks), and users must revert to using KedroSession for a packaged kedro library.
from kedro.framework.session import KedroSession
from kedro.framework.project import configure_project
package_name = <your_package_name>
configure_project(package_name)

with KedroSession.create(package_name) as session:
    session.run()
  1. Related to #1728, which rich override the ipython._showtraceback, and databricks will ignore any Python Exception, and treat a "databricks job" as success even if the Python program is terminated with exceptions.

noklam avatar Aug 26 '22 10:08 noklam

A few more notes on this. The sys.exit problem is already known. This is actually a problem completely independent of databricks and happens just in a normal IPython session (which makes it much easier to try out). It's discussed further in #1423. There are ways around it that don't need you to invoke KedroSession explicitly. It basically boils down to invoking click in such a way that it doesn't do the sys.exit:

Adventures with click

I opened an issue on the click repo (which didn't go down well... 😅) describing the original source of problems 1, 2, 3. In short, when you run a click command it always exits afterwards with an Exception, even if execution was successful. This means it's impossible to run any code after that main in a Python script and also massively confuses ipython, which blows up.

For the record, some of the things that don't quite fix this are:

  • run(standalone_mode=True) like suggested on the click repo. This solves problem 2 and 3 but not 1
  • run.callback. This nearly gets you there but doesn't fill in the default arguments for you so you need to explicitly give all the arguments in the function call, which is horrible
  • run.main is equivalent to what we were doing before (run.__call__) so doesn't help
  • run.invoke is really close to doing it, but unlike run.forward doesn't fills in default arguments and won't notice CLI arguments when called with python -m spaceflights --pipeline=ds

Problem 2 is fixed by https://github.com/kedro-org/kedro/pull/1769:

A fix for the issue raised by @pstanisl in https://github.com/kedro-org/kedro/issues/1728. Hopefully this will be fixed on the rich side in https://github.com/Textualize/rich/issues/2455 in due course, but for now I'm fixing it here because it's proving problematic, e.g. for @PedroAbreuQB. Jobs on databricks that should be detected as Failed and currently marked as Succeeded.

The solution is just to disable rich.traceback.install on databricks.

antonymilne avatar Sep 08 '22 13:09 antonymilne

I opened an issue on the click repo (which didn't go down well... 😅)

That's... disappointing. See a previous report of a similar issue https://github.com/pallets/click/issues/1054 (2018, locked in 2020)

I would like to better understand the context of this issue. First, I tried creating a barebones Click CLI:

$ cat click_adventures.py
"""
Adventures with Click.
"""

__version__ = "0.1.0"

import click


@click.command()
def cli():
    print("Hello, world!")


if __name__ == "__main__":
    cli()
$ cat pyproject.toml
[build-system]
requires = ["flit_core >=3.2,<4"]
build-backend = "flit_core.buildapi"

[project]
name = "click_adventure"
authors = [{name = "Juan Luis", email = "[email protected]"}]
dependencies = [
  "click",
]
dynamic = ["version", "description"]

[project.scripts]
click_adventure = "click_adventure:cli"

And it looks like running it using %sh in Databricks works just fine:

image

Next, I packaged the default spaceflights starter with Kedro 0.18.9 (which, by the way, includes things like kedro-viz and jupyterlab as dependencies... hopefully fixed when we address #2519) and installed it on Databricks. Then I uploaded the config and the data files, and everything seemed to work fine too:

image

What didn't work was running the project from Python, as advised in our docs https://docs.kedro.org/en/stable/tutorial/package_a_project.html#run-a-packaged-project

image ... image

Could this issue by addressed by

  1. Changing the official recommendation and
  2. Making the CLI functions very thin wrappers over non-Click functions that do the actual work? Essentially:
def run(...):
    ...

@click.command()
def run_command(args):
    click.echo(run(args))

(from https://github.com/pallets/click/issues/1054#issuecomment-400347771)

I just started using Kedro on Databricks, so please excuse my innocence and possibly candid calls for a solution - I'm sure everyone has tried everything in the book, and I'm also sure I'm about to discover more weirdnesses.

astrojuanlu avatar Jun 01 '23 06:06 astrojuanlu

This is an ongoing issue https://www.linen.dev/s/kedro/t/13153013/exit-traceback-most-recent-call-last-file-databricks-python-#14d50244-c91a-4738-b0de-11b822eb181b

Looking at the traceback, this is the code the user was trying to run:

params_string = ','.join(list_params)
main(
    [
        "--env", conf,
        "--pipeline", pipeline,
        f"--params={params_string}"
    ]
)

And filling the gaps, I think the main function does something like this:

run = _find_run_command(package_name)
run(*args, **kwargs)

which resembles https://github.com/kedro-org/kedro/blob/0.18.11/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/main.py

So, most likely the user is doing

from kedro_project.__main__ import main

main()

This touches on https://github.com/kedro-org/kedro/issues/2384 and https://github.com/kedro-org/kedro/issues/2682. I think we should address this soon.

astrojuanlu avatar Jul 12 '23 09:07 astrojuanlu

In line with what I said in https://github.com/kedro-org/kedro/issues/2682#issuecomment-1593075949,

  • I don't think we should rely on CLI functions to programmatically run Kedro (and so I'm in favor of refactoring rather than using standalone=True as suggested in #2682)
  • In that refactor I think we should hide CLI functions from the public API (otherwise users will be tempted to use them in Python, which is conceptually wrong, aside from click decision to raise an exception)

astrojuanlu avatar Jul 12 '23 09:07 astrojuanlu

This bug report is more than 1 year old. Can we get an updated view of what's happening here?

We have 3 official workflows for Databricks:

  • Using Databricks notebooks https://docs.kedro.org/en/stable/deployment/databricks/databricks_notebooks_development_workflow.html we are recommending the user that they do session.run(). As far as I understand, this is correct.
  • Using dbx[^1] https://docs.kedro.org/en/latest/deployment/databricks/databricks_ide_development_workflow.html we are also recommending session.run(), so same as above.
  • Using Databricks jobs https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html and here we are telling users to write a custom entry point, supposedly because of this issue. But isn't that workaround enough?

On the other hand, in this page https://docs.kedro.org/en/stable/tutorial/package_a_project.html we're telling users to do this:

from <package_name>.__main__ import main

main(
    ["--pipeline", "__default__"]
)  # or simply main() if you don't want to provide any arguments

which in my opinion is conceptually broken, because we are abusing a CLI function to run it programmatically.

@noklam called this above a "downgrade", but what's the problem with running

with KedroSession.create(package_name) as session:
    session.run()

? Should we update our docs instead?

@idanov's #3191 seems to make that CLI function use click standalone mode conditionally, but in my opinion (1) it's a hack and (2) doesn't really help us pinpoint why our users are running code that is broken instead of following the instructions from our docs.

What am I missing?

[^1]: Note that dbx is now deprecated in favour of Databricks Asset Bundles, this change hasn't made it to our docs yet https://github.com/kedro-org/kedro/pull/3212/

astrojuanlu avatar Nov 08 '23 11:11 astrojuanlu

@astrojuanlu The important thing here is "Packaged project" and "Kedro Project"

In a simplified world, , python -m <package> or the main function is basically bootstrap_project + KedroSession.run. There are some minor differnece (i.e. cli.py). Basically session is used for non-package project and main is used for package project.

The benefit of having an entrypoint is that most deployment system support this because it is just python -m. To use session.run you requires a custom python script for execution.

@noklam called this https://github.com/kedro-org/kedro/issues/1807#issuecomment-1228342162 a "downgrade", but what's the problem with running

with KedroSession.create(package_name) as session: session.run() See https://github.com/kedro-org/kedro/pull/1423#issuecomment-1228944699, originally I even propose kedro run --package_name=xxx

Using Databricks jobs https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html and here we are telling users to write a custom entry point, supposedly because of this issue. But isn't that workaround enough?

For this, I think this is a bad workaround as it shouldn't need a special entrypoint, https://github.com/kedro-org/kedro/pull/3191 should replace this. Assuming you didn't start with databricks-iris, you will need to do those copy & paste. This step can be removed because any Kedro project can share ONE entrypoint.

Honestly I don't hate KedroSession because users are already familar with it, and they can stick with it for both a normal Kedro Project or a packaged one. I see that there are some extra benefit for the entrypoint and keeping things consistently.

I think we need to be pragmatic here, whether or not CLI is the best way to do this is arguable. This will requires more discussion and design. Unless we plan to change this in short term. I think we should at least move forward with #3191 and continue this CLI vs click wrapper option.

noklam avatar Nov 08 '23 12:11 noklam

The important thing here is "Packaged project" and "Kedro Project"

In a simplified world, python -m <package> or the main function is basically bootstrap_project + KedroSession.run. There are some minor differnece (i.e. cli.py). Basically session is used for non-package project and main is used for package project.

Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.

For this, I think this is a bad workaround as it shouldn't need a special entrypoint

I agree 💯 I understand then that the problem is that currently users are forgetting to do that step, and they expect the default entrypoint to just work.

astrojuanlu avatar Nov 08 '23 13:11 astrojuanlu

Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.

It is more generic. from my_package import main wouldn't work nicely in any interactive environment (Ipython, Jupyter) because of the sys.exit. Of course you can still use the workaround by manually doing bootstrap_project

https://github.com/kedro-org/kedro/issues/3237 has more context but #3191 go beyond just Databricks, it also affect integrating Kedro with other downstream task. (the CLI way doesn't work currently, so only KedroSession.run works)

noklam avatar Nov 08 '23 13:11 noklam

Of course you can still use the workaround by manually doing bootstrap_project

This has been my point all along: that this shouldn't be a "workaround", it should be our recommended way of doing it.

I'm not blocking #3191 on this discussion, but I'm hoping we can have it some day. Reason being: running a Kedro project programmatically requires either abusing these CLI functions, or adding some magic (bootstrap_project? configure_project? what do these functions do?) that obscure what's happening.

astrojuanlu avatar Nov 08 '23 13:11 astrojuanlu

In the past we had a few nice .puml diagrams, showing the whole list of possible execution flows in different modes of execution, but it seems that they have been deleted from the repo. But in general, the idea is that we have two distinct regimes:

  • running during development, i.e. off the source code
  • running during production, i.e. off of a packaged project

bootstrap_project is only used in the first regime, because it adds src/ or the custom directory you've decided to use to the Python import path. configure_project is the one allowing you to import things from your own project through kedro.framework.project and is essential for having a uniform way for Kedro and Kedro plugins to access project specific things like settings and registered pipelines without knowing the project package name (which is obviously always different for different Kedro projects). configure_project is what gives Kedro the django-like experience for settings.

It's probably worth documenting that for advanced users who would like to know how Kedro works internally and maybe retrieve the old .puml diagrams we've had.

idanov avatar Nov 09 '23 10:11 idanov

.puml diagrams for reference https://github.com/kedro-org/kedro/tree/0.17.7/docs/draft

astrojuanlu avatar Nov 10 '23 08:11 astrojuanlu

I generated a reproducer that doesn't depend on Kedro.

Source code: https://github.com/astrojuanlu/simple-click-entrypoint

Job definition:

image

Result:

image

image

astrojuanlu avatar Nov 10 '23 08:11 astrojuanlu

Reported this upstream https://github.com/MicrosoftDocs/azure-docs/issues/116964

astrojuanlu avatar Nov 10 '23 08:11 astrojuanlu

Did a bit more debugging https://github.com/kedro-org/kedro/pull/3191#pullrequestreview-1764364409 the sys.ps1 trick proposed there is not working. There's not enough clarity on how Databricks launches the entry point.

It's clear that passing standalone_mode=False should work - the problem is how to detect when to pass it.

astrojuanlu avatar Dec 05 '23 08:12 astrojuanlu

The sorry state of this issue is as follows:

  • Microsoft created an internal work item to improve the docs, no way to track progress in public
  • Databricks (in private communication) expressed that they'd rather not document the low-level details of how Databricks invokes the code
  • Our conditional activation of the standalone mode in #3191 didn't work, so I went ahead and closed the PR
  • Unconditionally using standalone mode is probably unwise, since the CLI would likely misbehave

We need to keep brainstorming. I honestly thing the best way forward would be to simplify the default __main__ and entry point https://github.com/kedro-org/kedro/issues/3051, ideally making them independent from our Click wrappers and avoid having 5 different context-dependent ways of running a Kedro project #3237 but maybe there are other possible solutions.

astrojuanlu avatar Feb 19 '24 09:02 astrojuanlu

Leave a comment here: https://github.com/kedro-org/kedro/issues/3237#issuecomment-1969278134. This goes beyond Databricks so I want to keep the conversation on the broader topic.

noklam avatar Feb 28 '24 15:02 noklam

Useful bit of information I found in uv: https://github.com/astral-sh/uv/pull/3890 the JPY_SESSION_NAME environment variable is set for notebooks.

See:

  • jupyter-server>=2.0.0 setting the JPY_SESSION_NAME environment variable https://github.com/jupyter-server/jupyter_server/pull/679/
  • ipykernel>=6.21.2 setting the global __session__ variable with the value of JPY_SESSION_NAME if set https://github.com/ipython/ipykernel/pull/1095/

(From https://stackoverflow.com/a/77904549 [^1])

For whatever reason this doesn't work on VSCode notebooks though, they set a variable called __vsc_ipynb_file__ https://github.com/microsoft/vscode-jupyter/pull/8531

Dropping this here in case it helps.

[^1]: Unlike ChatGPT, I can cite my sources 😉

astrojuanlu avatar Jun 04 '24 06:06 astrojuanlu

Nah, none of this works on Databricks 👎🏼

astrojuanlu avatar Jun 04 '24 06:06 astrojuanlu