kedro
kedro copied to clipboard
Packaged kedro pipeline does not work well on Databricks
Originated from internal users https://quantumblack.slack.com/archives/C7JTFSFUY/p1661236540255789
- Packaged kedro are run as a CLI -> Click generate a sys.exit() and cause some problem on databricks.
This is one of the things I have fixed in https://github.com/kedro-org/kedro/pull/1423. I really need to write it up so we can fix it properly (I have a better fix than that PR). Let me check the slack thread now anyway.
Add some more comments to document what happened. In summary, there are 2 issues:
- The CLI mode generate some
sys.exit()
which some platform doesn't welcome (i.e. databricks), and users must revert to usingKedroSession
for a packaged kedro library.
from kedro.framework.session import KedroSession
from kedro.framework.project import configure_project
package_name = <your_package_name>
configure_project(package_name)
with KedroSession.create(package_name) as session:
session.run()
- Related to #1728, which
rich
override theipython._showtraceback
, and databricks will ignore any PythonException
, and treat a "databricks job" as success even if the Python program is terminated with exceptions.
A few more notes on this. The sys.exit
problem is already known. This is actually a problem completely independent of databricks and happens just in a normal IPython session (which makes it much easier to try out). It's discussed further in #1423. There are ways around it that don't need you to invoke KedroSession
explicitly. It basically boils down to invoking click in such a way that it doesn't do the sys.exit
:
Adventures with click
I opened an issue on the click repo (which didn't go down well... 😅) describing the original source of problems 1, 2, 3. In short, when you run a click command it always exits afterwards with an Exception, even if execution was successful. This means it's impossible to run any code after that
main
in a Python script and also massively confuses ipython, which blows up.For the record, some of the things that don't quite fix this are:
run(standalone_mode=True)
like suggested on the click repo. This solves problem 2 and 3 but not 1run.callback
. This nearly gets you there but doesn't fill in the default arguments for you so you need to explicitly give all the arguments in the function call, which is horriblerun.main
is equivalent to what we were doing before (run.__call__
) so doesn't helprun.invoke
is really close to doing it, but unlikerun.forward
doesn't fills in default arguments and won't notice CLI arguments when called withpython -m spaceflights --pipeline=ds
Problem 2 is fixed by https://github.com/kedro-org/kedro/pull/1769:
A fix for the issue raised by @pstanisl in https://github.com/kedro-org/kedro/issues/1728. Hopefully this will be fixed on the rich side in https://github.com/Textualize/rich/issues/2455 in due course, but for now I'm fixing it here because it's proving problematic, e.g. for @PedroAbreuQB. Jobs on databricks that should be detected as Failed and currently marked as Succeeded.
The solution is just to disable
rich.traceback.install
on databricks.
I opened an issue on the click repo (which didn't go down well... 😅)
That's... disappointing. See a previous report of a similar issue https://github.com/pallets/click/issues/1054 (2018, locked in 2020)
I would like to better understand the context of this issue. First, I tried creating a barebones Click CLI:
$ cat click_adventures.py
"""
Adventures with Click.
"""
__version__ = "0.1.0"
import click
@click.command()
def cli():
print("Hello, world!")
if __name__ == "__main__":
cli()
$ cat pyproject.toml
[build-system]
requires = ["flit_core >=3.2,<4"]
build-backend = "flit_core.buildapi"
[project]
name = "click_adventure"
authors = [{name = "Juan Luis", email = "[email protected]"}]
dependencies = [
"click",
]
dynamic = ["version", "description"]
[project.scripts]
click_adventure = "click_adventure:cli"
And it looks like running it using %sh
in Databricks works just fine:
Next, I packaged the default spaceflights
starter with Kedro 0.18.9 (which, by the way, includes things like kedro-viz and jupyterlab as dependencies... hopefully fixed when we address #2519) and installed it on Databricks. Then I uploaded the config and the data files, and everything seemed to work fine too:
What didn't work was running the project from Python, as advised in our docs https://docs.kedro.org/en/stable/tutorial/package_a_project.html#run-a-packaged-project
...
Could this issue by addressed by
- Changing the official recommendation and
- Making the CLI functions very thin wrappers over non-Click functions that do the actual work? Essentially:
def run(...):
...
@click.command()
def run_command(args):
click.echo(run(args))
(from https://github.com/pallets/click/issues/1054#issuecomment-400347771)
I just started using Kedro on Databricks, so please excuse my innocence and possibly candid calls for a solution - I'm sure everyone has tried everything in the book, and I'm also sure I'm about to discover more weirdnesses.
This is an ongoing issue https://www.linen.dev/s/kedro/t/13153013/exit-traceback-most-recent-call-last-file-databricks-python-#14d50244-c91a-4738-b0de-11b822eb181b
Looking at the traceback, this is the code the user was trying to run:
params_string = ','.join(list_params)
main(
[
"--env", conf,
"--pipeline", pipeline,
f"--params={params_string}"
]
)
And filling the gaps, I think the main
function does something like this:
run = _find_run_command(package_name)
run(*args, **kwargs)
which resembles https://github.com/kedro-org/kedro/blob/0.18.11/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/main.py
So, most likely the user is doing
from kedro_project.__main__ import main
main()
This touches on https://github.com/kedro-org/kedro/issues/2384 and https://github.com/kedro-org/kedro/issues/2682. I think we should address this soon.
In line with what I said in https://github.com/kedro-org/kedro/issues/2682#issuecomment-1593075949,
- I don't think we should rely on CLI functions to programmatically run Kedro (and so I'm in favor of refactoring rather than using
standalone=True
as suggested in #2682) - In that refactor I think we should hide CLI functions from the public API (otherwise users will be tempted to use them in Python, which is conceptually wrong, aside from click decision to raise an exception)
This bug report is more than 1 year old. Can we get an updated view of what's happening here?
We have 3 official workflows for Databricks:
- Using Databricks notebooks https://docs.kedro.org/en/stable/deployment/databricks/databricks_notebooks_development_workflow.html we are recommending the user that they do
session.run()
. As far as I understand, this is correct. - Using dbx[^1] https://docs.kedro.org/en/latest/deployment/databricks/databricks_ide_development_workflow.html we are also recommending
session.run()
, so same as above. - Using Databricks jobs https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html and here we are telling users to write a custom entry point, supposedly because of this issue. But isn't that workaround enough?
On the other hand, in this page https://docs.kedro.org/en/stable/tutorial/package_a_project.html we're telling users to do this:
from <package_name>.__main__ import main
main(
["--pipeline", "__default__"]
) # or simply main() if you don't want to provide any arguments
which in my opinion is conceptually broken, because we are abusing a CLI function to run it programmatically.
@noklam called this above a "downgrade", but what's the problem with running
with KedroSession.create(package_name) as session:
session.run()
? Should we update our docs instead?
@idanov's #3191 seems to make that CLI function use click standalone mode conditionally, but in my opinion (1) it's a hack and (2) doesn't really help us pinpoint why our users are running code that is broken instead of following the instructions from our docs.
What am I missing?
[^1]: Note that dbx is now deprecated in favour of Databricks Asset Bundles, this change hasn't made it to our docs yet https://github.com/kedro-org/kedro/pull/3212/
@astrojuanlu The important thing here is "Packaged project" and "Kedro Project"
In a simplified world, , python -m <package>
or the main
function is basically bootstrap_project
+ KedroSession.run
. There are some minor differnece (i.e. cli.py). Basically session
is used for non-package project and main
is used for package project.
The benefit of having an entrypoint is that most deployment system support this because it is just python -m
. To use session.run
you requires a custom python script for execution.
@noklam called this https://github.com/kedro-org/kedro/issues/1807#issuecomment-1228342162 a "downgrade", but what's the problem with running
with KedroSession.create(package_name) as session:
session.run()
See https://github.com/kedro-org/kedro/pull/1423#issuecomment-1228944699, originally I even propose kedro run --package_name=xxx
Using Databricks jobs https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html and here we are telling users to write a custom entry point, supposedly because of this issue. But isn't that workaround enough?
For this, I think this is a bad workaround as it shouldn't need a special entrypoint, https://github.com/kedro-org/kedro/pull/3191 should replace this.
Assuming you didn't start with databricks-iris
, you will need to do those copy & paste. This step can be removed because any Kedro project can share ONE entrypoint.
Honestly I don't hate KedroSession
because users are already familar with it, and they can stick with it for both a normal Kedro Project or a packaged one. I see that there are some extra benefit for the entrypoint and keeping things consistently.
I think we need to be pragmatic here, whether or not CLI is the best way to do this is arguable. This will requires more discussion and design. Unless we plan to change this in short term. I think we should at least move forward with #3191 and continue this CLI vs click wrapper option.
The important thing here is "Packaged project" and "Kedro Project"
In a simplified world,
python -m <package>
or themain
function is basicallybootstrap_project
+KedroSession.run
. There are some minor differnece (i.e. cli.py). Basically session is used for non-package project andmain
is used for package project.
Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.
For this, I think this is a bad workaround as it shouldn't need a special entrypoint
I agree 💯 I understand then that the problem is that currently users are forgetting to do that step, and they expect the default entrypoint to just work.
Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.
It is more generic. from my_package import main
wouldn't work nicely in any interactive environment (Ipython, Jupyter) because of the sys.exit. Of course you can still use the workaround by manually doing bootstrap_project
https://github.com/kedro-org/kedro/issues/3237 has more context but #3191 go beyond just Databricks, it also affect integrating Kedro with other downstream task. (the CLI way doesn't work currently, so only KedroSession.run works)
Of course you can still use the workaround by manually doing
bootstrap_project
This has been my point all along: that this shouldn't be a "workaround", it should be our recommended way of doing it.
I'm not blocking #3191 on this discussion, but I'm hoping we can have it some day. Reason being: running a Kedro project programmatically requires either abusing these CLI functions, or adding some magic (bootstrap_project
? configure_project
? what do these functions do?) that obscure what's happening.
In the past we had a few nice .puml
diagrams, showing the whole list of possible execution flows in different modes of execution, but it seems that they have been deleted from the repo. But in general, the idea is that we have two distinct regimes:
- running during development, i.e. off the source code
- running during production, i.e. off of a packaged project
bootstrap_project
is only used in the first regime, because it adds src/
or the custom directory you've decided to use to the Python import path. configure_project
is the one allowing you to import things from your own project through kedro.framework.project
and is essential for having a uniform way for Kedro and Kedro plugins to access project specific things like settings and registered pipelines without knowing the project package name (which is obviously always different for different Kedro projects). configure_project
is what gives Kedro the django-like experience for settings.
It's probably worth documenting that for advanced users who would like to know how Kedro works internally and maybe retrieve the old .puml
diagrams we've had.
.puml
diagrams for reference https://github.com/kedro-org/kedro/tree/0.17.7/docs/draft
I generated a reproducer that doesn't depend on Kedro.
Source code: https://github.com/astrojuanlu/simple-click-entrypoint
Job definition:
Result:
Reported this upstream https://github.com/MicrosoftDocs/azure-docs/issues/116964
Did a bit more debugging https://github.com/kedro-org/kedro/pull/3191#pullrequestreview-1764364409 the sys.ps1
trick proposed there is not working. There's not enough clarity on how Databricks launches the entry point.
It's clear that passing standalone_mode=False
should work - the problem is how to detect when to pass it.
The sorry state of this issue is as follows:
- Microsoft created an internal work item to improve the docs, no way to track progress in public
- Databricks (in private communication) expressed that they'd rather not document the low-level details of how Databricks invokes the code
- Our conditional activation of the standalone mode in #3191 didn't work, so I went ahead and closed the PR
- Unconditionally using standalone mode is probably unwise, since the CLI would likely misbehave
We need to keep brainstorming. I honestly thing the best way forward would be to simplify the default __main__
and entry point https://github.com/kedro-org/kedro/issues/3051, ideally making them independent from our Click wrappers and avoid having 5 different context-dependent ways of running a Kedro project #3237 but maybe there are other possible solutions.
Leave a comment here: https://github.com/kedro-org/kedro/issues/3237#issuecomment-1969278134. This goes beyond Databricks so I want to keep the conversation on the broader topic.
Useful bit of information I found in uv
: https://github.com/astral-sh/uv/pull/3890 the JPY_SESSION_NAME
environment variable is set for notebooks.
See:
-
jupyter-server>=2.0.0
setting theJPY_SESSION_NAME
environment variable https://github.com/jupyter-server/jupyter_server/pull/679/ -
ipykernel>=6.21.2
setting the global__session__
variable with the value ofJPY_SESSION_NAME
if set https://github.com/ipython/ipykernel/pull/1095/
(From https://stackoverflow.com/a/77904549 [^1])
For whatever reason this doesn't work on VSCode notebooks though, they set a variable called __vsc_ipynb_file__
https://github.com/microsoft/vscode-jupyter/pull/8531
Dropping this here in case it helps.
[^1]: Unlike ChatGPT, I can cite my sources 😉
Nah, none of this works on Databricks 👎🏼