kedro
kedro copied to clipboard
[Parent] - Notebook/IPython Debug line magic `%load_node` feature discussion thread
Description
This ticket is created to track and discuss the development of this feature.
What does this feature do?
In short, it tries to recreate the context of a pipeline error by eliminating manual steps as much as possible to provide better UX. It loads up the data, imports (and) function body in notebook cells, allowing user to quickly start debugging, plotting the data etc.
Tracked Tickets:
-
[ ] #3510 is the initial PR for the implementation
-
[x] #3585
-
[x] #2011
-
[x] #3536
-
[ ] #3528 - Add e2e test, blocked by an issue that running notebook programatically yield different result.
Comments
I have collected some comments from a few users and these are some quotes categorised by themes.
IPython
But I agree, ipython version, since you're likely in your IDE, no need to bring in the function body. You can open the right file and dev there HI. very nice! is it something that could work in kedro ipython also direclty ?
Support more platform
- support platform:
- Mainly notbeook & Lab
- VSCode Notebook (not yet)
- Databricks (not tested)
Two-way sync
Would be handy to debug, but you should do this in the IDE instead of a notebook. If you change the code, you still have to manually sync the changes to VCS Look awesome @Nok . Would there be a path back from notebook into the project’s files (as it was with kedro jupyter convert )?
Recursive definition of function body
Also how would this work if your node function imports other functions which may import other functions? You only show where depth is 1?
Edge Case improvement
- Handle imports that are not from top of the script, i.e. functions define in the same file as the node function which lives in locals().
- Intermediate MemoryDataset
As stretch goal a runner like --runner=SequentialDebug
where the last failed node spins up a notebook with the function body and data available to work out why it failed 🚀
As stretch goal a runner like --runner=SequentialDebug where the last failed node spins up a notebook with the function body and data available to work out why it failed 🚀
This is actually a very cool idea - alternatively, instead of a notebook it could be a (pdb) session
ipdb
!
Context for Tech Design - 31/1/24
Feature motivation - #1832
To summarise, debugging Kedro projects in notebooks is a painful process. One of the suggestions to make this easier for users included the proposal of a %load_node
line magic, that would load a node from a user's Kedro project pipeline into notebook cells for easier debugging. The current MVP for this feature lives in #3510.
Steps to test the MVP
- Clone the PR an install your local Kedro version
- Navigate to an existing test project, or a directory where you wish to create one and follow
kedro new
. Make sure to select example code if not using a starter. - Navigate into your project root
- Run
kedro jupyter lab
- In the notebook, test the line magic with
%load_node <node-name>
The current implementation
The line magic currently will load any imports in the source file of the node, the node's inputs (assumed to be declared in the catalog), and the body of the node function.
This serves as an excellent starting point, the implementation successfully resolves:
- Nested functions
- One line functions
- Messy/scattered import statements
However, it does not account/allow for:
- References to objects out of node function scope but within the source-file
- MemoryDataset inputs
-
kedro ipython
and other platforms - Unnamed nodes (node name is not required when defining a node)
This brings us to the points of our discussion.
What makes a complete feature?
The focus of this discussion is to iron out the expectations for the first version of this feature, and possible future extensions. In this, there are four primary themes to consider. In this discussion, we must determine if a particular use case should be addressed now, later, much later, or never. Additionally, if something is to be done later, much later, or never, what will we do in the interim?
Consider: Extending the current implementation
Several edge cases are not handled by the line magic:
1. If a node input is a MemoryDataset
The current implementation assumes all inputs exist in the catalogue, and running the cell that loads the node inputs will result in an error if the input is not defined in the catalog. We have several options when addressing this:
a. Ignore it - expect users to understand the error. Pro: No technical requirement. Con: Invites friction and confusion
b. Add a comment to the imports block explaining the above. Pro: No technical requirement. Con: Requires users to edit their project catalog to continue debugging with the feature.
c. Resolve it with the line magic. Pro: Frictionless for users. Con: Technically challenging - requires it's own investigation into how this could be implemented.
2. Node references an object defined within the source-file but outside of node scope
Including only imports and the node function means we don't have access to any helper functions defined inside the file but outside of the node function. Any references to these when the node function is loaded will result in an error.
a. Ignore it - expect users to understand the error. Pro: No technical requirement. Con: Invites friction and confusion
b. Add from <node.source.module> import *
Pro: Frictionless for users. Con: Technically challenging - requires it's own investigation into how this could be implemented.
3. Return statements make unrunnable cells
When loading a node function, we omit the defining line. Including return statements then results in a syntax error when the cell is run.
a. Ignore it - expect users to understand the error. Pro: No technical requirement. Con: Invites friction and confusion
b. Comment out the return statement Pro: Simple Con: Introduces syntax errors if returns are within conditions, see:
x = yyy
if <some_flag):
# return x
else:
# return None
c. Comment out return statement and introduce print statement as replacement. Pro: Simple to introduce - exists on #3568, eliminates errors. Con: Modifies node, user will have to fix this if copying the code back into their project files
Consider: Supporting Other Platforms
#3510 Only supports jupyter lab/notebook. #3536 introduces support for kedro ipython
(Support across notebook/lab most IDE and platforms except databricks). What should we do about:
- Databricks notebooks
- VS Code
And if we choose not to support any platforms in the current iteration, how do we inform users? (Multiple selections allowed)
- Docs
- Release Notes
- Warn when trying to use the line magic in unsupported platform
Consider: Supporting the node-finding process by making changes to the default node name
As mentioned briefly above, node names are not required when defining a node, and are often omitted by users. However, the current implementation of the line magic searches for the specified by filtering the pipeline for matching node names, and so if the name is not defined, the %load_node
will not be able to find the node.
We could address this by:
-
Mentioning this in the
ValueNotFoundError
Pro: Trivial to add. Makes user aware of problem. Con: Requires user to modify project files to continue debugging in notebook -
Search all
node.func.__name__
if node name is not found - see #3568 Pro: Solves problem. Easy to implement. Con: Ugly complexity. Behaviour diverges fromkedro run --nodes/--to-nodes/--from-nodes
-
Modify default node name - see #3575 Pro: Solves issue across Kedro, not just for line magic. Con: technically breaking. Can we ship these changes together?
[Extra] Consider: The future of notebook debugging
There were some suggestions for the future of this feature, for completeness I am listing them here
- 2-way conversion (
%load_node
and%save_node
?) - do we bring backkedro jupyter convert
? xref - A SequentialDebug runner - xref
- Populating problematic functions when error results from function call - xref
References: main PR: #3510, original discussion: #1832, experiments: #3568, related: #3575, #3536
My 2 cents:
Extending the current implementation
2c. Bringing the full contents of the corresponding file, including the function definitions themselves 3d. Do not unpack the function body, and retain the function definition (see also point above)
Supporting Other Platforms
Frankly I'd ignore kedro ipython
, kedro jupyter
and shift the focus to "make this work where %load_ext kedro.ipython
works". This doesn't work for pure REPLs, and also it's unclear how other non-Jupyter notebooks would behave (they're all mostly based on IPython, but then frontend code might differ significantly). But the point is that I wouldn't consider kedro ipython/jupyter
a platform because of https://github.com/kedro-org/kedro/issues/2777
The future of notebook debugging
I would say let's not bring back kedro jupyter convert
, it didn't work back then and there's no reason it will work now.
Having a %save_node
is promising, but maybe we should broaden our perspective and have a %save_code
that is more generic? Note that this is a big, unsolved problem in the Jupyter <-> IDE interaction as a whole.
Extending the current implementation
- Nit on 3c: should be a
display()
call? - @astrojuanlu's suggestion 3d also makes sense to me
Supporting Other Platforms
For a lot of these edge cases, as well as for other platforms, I think I'm OK with having some friction (and perhaps marking the feature as experimental for now). I think there's a risk of going too deep in this feature, without understanding how many people actually use it/how they use it.
Tech Design Summary
Now (part of #3510):
- [x] Create parent issue to track "later" actions and to be referenced in universal warning (below): #3580
- [x] Warn users when preparing node inputs that inputs need to be explicitly declared in the catalog
- [x] ~~Handle return statements by commenting out and substituting ~print~
display()
~~ Include function definition and function call with node inputs - see comment below for why - [x] Add a line when node not found about node name vs node function
- [x] Add a platform agnostic warning that this is an experimental feature and is only supports jupyter and ipython
Next:
- Make sure platform support is well documented - #2011
- iPython support - #3536
Later
- from node.source.file import * (for helper functions)
- Support other platforms
- Resolving memory datasets (can be ignored fully)
- Update default node name - will have to go in next major release, needs to be unique. More discussion required.
Sad I couldn't join the design meeting but this looks great - thanks @noklam for initial innovation which now feels like a tangible reality and also @AhdraMeraliQB for making the decisions made easy to grok!
One salient point I just realized about unpacking the function body and transforming return
statements into print
or display
- we would be at risk of changing the logic of the code.
A silly example:
def foo():
if True:
return 1
subprocess.check_output("rm -rf /") # but `True is True` always right??
so foo()
will absolutely always return 1
.
However:
if True:
display(1)
subprocess.check_output("rm -rf /") # but `True is True` always right??
will nuke your filesystem.
Could we reconsider my proposal above
3d. Do not unpack the function body, and retain the function definition (see also point above)
? I think we never got to discuss it during the call (my fault, the conversation drifted exactly at that point and I forgot to bring it up)
Any thoughts about https://github.com/kedro-org/kedro/issues/3535#issuecomment-1919690287 @noklam ?
This was addressed already, apologies 👍🏽
Closing this as this is released in 0.19.3. Follow up improvement discussion will be document here: https://github.com/kedro-org/kedro/issues/3580