Description

This ticket is created to track and discuss the development of this feature.

What does this feature do?

In short, it tries to recreate the context of a pipeline error by eliminating manual steps as much as possible to provide better UX. It loads up the data, imports (and) function body in notebook cells, allowing user to quickly start debugging, plotting the data etc.

Tracked Tickets:

[ ] #3510 is the initial PR for the implementation
[x] #3585
[x] #2011
[x] #3536
[ ] #3528 - Add e2e test, blocked by an issue that running notebook programatically yield different result.

Comments

I have collected some comments from a few users and these are some quotes categorised by themes.

IPython

But I agree, ipython version, since you're likely in your IDE, no need to bring in the function body. You can open the right file and dev there HI. very nice! is it something that could work in kedro ipython also direclty ?

Support more platform

support platform:
- Mainly notbeook & Lab
- VSCode Notebook (not yet)
- Databricks (not tested)

Two-way sync

Would be handy to debug, but you should do this in the IDE instead of a notebook. If you change the code, you still have to manually sync the changes to VCS Look awesome @Nok . Would there be a path back from notebook into the project’s files (as it was with kedro jupyter convert )?

Recursive definition of function body

Also how would this work if your node function imports other functions which may import other functions? You only show where depth is 1?

Edge Case improvement

Handle imports that are not from top of the script, i.e. functions define in the same file as the node function which lives in locals().
Intermediate MemoryDataset

Jan 19 '24 17:01 noklam

As stretch goal a runner like --runner=SequentialDebug where the last failed node spins up a notebook with the function body and data available to work out why it failed 🚀

Jan 19 '24 17:01 datajoely

As stretch goal a runner like --runner=SequentialDebug where the last failed node spins up a notebook with the function body and data available to work out why it failed 🚀

This is actually a very cool idea - alternatively, instead of a notebook it could be a (pdb) session

Jan 23 '24 16:01 astrojuanlu

ipdb!

Jan 23 '24 18:01 datajoely

Context for Tech Design - 31/1/24

Feature motivation - #1832

To summarise, debugging Kedro projects in notebooks is a painful process. One of the suggestions to make this easier for users included the proposal of a %load_node line magic, that would load a node from a user's Kedro project pipeline into notebook cells for easier debugging. The current MVP for this feature lives in #3510.

Steps to test the MVP

Clone the PR an install your local Kedro version
Navigate to an existing test project, or a directory where you wish to create one and follow kedro new. Make sure to select example code if not using a starter.
Navigate into your project root
Run kedro jupyter lab
In the notebook, test the line magic with %load_node <node-name>

The current implementation

The line magic currently will load any imports in the source file of the node, the node's inputs (assumed to be declared in the catalog), and the body of the node function.

debug

This serves as an excellent starting point, the implementation successfully resolves:

Nested functions
One line functions
Messy/scattered import statements

However, it does not account/allow for:

References to objects out of node function scope but within the source-file
MemoryDataset inputs
kedro ipython and other platforms
Unnamed nodes (node name is not required when defining a node)

This brings us to the points of our discussion.

What makes a complete feature?

The focus of this discussion is to iron out the expectations for the first version of this feature, and possible future extensions. In this, there are four primary themes to consider. In this discussion, we must determine if a particular use case should be addressed now, later, much later, or never. Additionally, if something is to be done later, much later, or never, what will we do in the interim?

Consider: Extending the current implementation

Several edge cases are not handled by the line magic:

1. If a node input is a MemoryDataset

The current implementation assumes all inputs exist in the catalogue, and running the cell that loads the node inputs will result in an error if the input is not defined in the catalog. We have several options when addressing this:

a. Ignore it - expect users to understand the error. Pro: No technical requirement. Con: Invites friction and confusion

b. Add a comment to the imports block explaining the above. Pro: No technical requirement. Con: Requires users to edit their project catalog to continue debugging with the feature.

c. Resolve it with the line magic. Pro: Frictionless for users. Con: Technically challenging - requires it's own investigation into how this could be implemented.

2. Node references an object defined within the source-file but outside of node scope

Including only imports and the node function means we don't have access to any helper functions defined inside the file but outside of the node function. Any references to these when the node function is loaded will result in an error.

a. Ignore it - expect users to understand the error. Pro: No technical requirement. Con: Invites friction and confusion

b. Add from <node.source.module> import * Pro: Frictionless for users. Con: Technically challenging - requires it's own investigation into how this could be implemented.

3. Return statements make unrunnable cells

When loading a node function, we omit the defining line. Including return statements then results in a syntax error when the cell is run.

a. Ignore it - expect users to understand the error. Pro: No technical requirement. Con: Invites friction and confusion

b. Comment out the return statement Pro: Simple Con: Introduces syntax errors if returns are within conditions, see:

x = yyy
if <some_flag):
  # return x
else:
  # return None

c. Comment out return statement and introduce print statement as replacement. Pro: Simple to introduce - exists on #3568, eliminates errors. Con: Modifies node, user will have to fix this if copying the code back into their project files

Consider: Supporting Other Platforms

#3510 Only supports jupyter lab/notebook. #3536 introduces support for kedro ipython (Support across notebook/lab most IDE and platforms except databricks). What should we do about:

Databricks notebooks
VS Code

And if we choose not to support any platforms in the current iteration, how do we inform users? (Multiple selections allowed)

Docs
Release Notes
Warn when trying to use the line magic in unsupported platform

Consider: Supporting the node-finding process by making changes to the default node name

As mentioned briefly above, node names are not required when defining a node, and are often omitted by users. However, the current implementation of the line magic searches for the specified by filtering the pipeline for matching node names, and so if the name is not defined, the %load_node will not be able to find the node.

We could address this by:

Mentioning this in the ValueNotFoundError Pro: Trivial to add. Makes user aware of problem. Con: Requires user to modify project files to continue debugging in notebook
Search all node.func.__name__ if node name is not found - see #3568 Pro: Solves problem. Easy to implement. Con: Ugly complexity. Behaviour diverges from kedro run --nodes/--to-nodes/--from-nodes
Modify default node name - see #3575 Pro: Solves issue across Kedro, not just for line magic. Con: technically breaking. Can we ship these changes together?

[Extra] Consider: The future of notebook debugging

There were some suggestions for the future of this feature, for completeness I am listing them here

2-way conversion (%load_node and %save_node ?) - do we bring back kedro jupyter convert? xref
A SequentialDebug runner - xref
Populating problematic functions when error results from function call - xref

References: main PR: #3510, original discussion: #1832, experiments: #3568, related: #3575, #3536

Jan 31 '24 13:01 AhdraMeraliQB

My 2 cents:

Extending the current implementation

2c. Bringing the full contents of the corresponding file, including the function definitions themselves 3d. Do not unpack the function body, and retain the function definition (see also point above)

Supporting Other Platforms

Frankly I'd ignore kedro ipython, kedro jupyter and shift the focus to "make this work where %load_ext kedro.ipython works". This doesn't work for pure REPLs, and also it's unclear how other non-Jupyter notebooks would behave (they're all mostly based on IPython, but then frontend code might differ significantly). But the point is that I wouldn't consider kedro ipython/jupyter a platform because of https://github.com/kedro-org/kedro/issues/2777

The future of notebook debugging

I would say let's not bring back kedro jupyter convert, it didn't work back then and there's no reason it will work now.

Having a %save_node is promising, but maybe we should broaden our perspective and have a %save_code that is more generic? Note that this is a big, unsolved problem in the Jupyter <-> IDE interaction as a whole.

Jan 31 '24 14:01 astrojuanlu

Extending the current implementation

Nit on 3c: should be a display() call?
@astrojuanlu's suggestion 3d also makes sense to me

Supporting Other Platforms

For a lot of these edge cases, as well as for other platforms, I think I'm OK with having some friction (and perhaps marking the feature as experimental for now). I think there's a risk of going too deep in this feature, without understanding how many people actually use it/how they use it.

Jan 31 '24 15:01 deepyaman

Tech Design Summary

Link to Miro board

Now (part of #3510):

[x] Create parent issue to track "later" actions and to be referenced in universal warning (below): #3580
[x] Warn users when preparing node inputs that inputs need to be explicitly declared in the catalog
[x] ~~Handle return statements by commenting out and substituting ~print~ display()~~ Include function definition and function call with node inputs - see comment below for why
[x] Add a line when node not found about node name vs node function
[x] Add a platform agnostic warning that this is an experimental feature and is only supports jupyter and ipython

Later

from node.source.file import * (for helper functions)
Support other platforms
Resolving memory datasets (can be ignored fully)
Update default node name - will have to go in next major release, needs to be unique. More discussion required.

Jan 31 '24 16:01 AhdraMeraliQB

Sad I couldn't join the design meeting but this looks great - thanks @noklam for initial innovation which now feels like a tangible reality and also @AhdraMeraliQB for making the decisions made easy to grok!

Jan 31 '24 17:01 datajoely

One salient point I just realized about unpacking the function body and transforming return statements into print or display - we would be at risk of changing the logic of the code.

A silly example:

def foo():
    if True:
        return 1
    subprocess.check_output("rm -rf /")  # but `True is True` always right??

so foo() will absolutely always return 1.

However:

if True:
    display(1)
subprocess.check_output("rm -rf /")  # but `True is True` always right??

will nuke your filesystem.

Could we reconsider my proposal above

3d. Do not unpack the function body, and retain the function definition (see also point above)

? I think we never got to discuss it during the call (my fault, the conversation drifted exactly at that point and I forgot to bring it up)

Jan 31 '24 18:01 astrojuanlu

Any thoughts about https://github.com/kedro-org/kedro/issues/3535#issuecomment-1919690287 @noklam ?

Feb 24 '24 09:02 astrojuanlu

This was addressed already, apologies 👍🏽

Feb 24 '24 09:02 astrojuanlu

Closing this as this is released in 0.19.3. Follow up improvement discussion will be document here: https://github.com/kedro-org/kedro/issues/3580

Mar 05 '24 13:03 noklam

kedro
kedro copied to clipboard

[Parent] - Notebook/IPython Debug line magic `%load_node` feature discussion thread

Description

What does this feature do?

Comments

IPython

Support more platform

Two-way sync

Recursive definition of function body

Edge Case improvement

Context for Tech Design - 31/1/24

Feature motivation - #1832

Steps to test the MVP

The current implementation

What makes a complete feature?

Consider: Extending the current implementation

1. If a node input is a MemoryDataset

2. Node references an object defined within the source-file but outside of node scope

3. Return statements make unrunnable cells

Consider: Supporting Other Platforms

Consider: Supporting the node-finding process by making changes to the default node name

[Extra] Consider: The future of notebook debugging

Extending the current implementation

Supporting Other Platforms

The future of notebook debugging

Extending the current implementation

Supporting Other Platforms

Tech Design Summary

Now (part of #3510):

Next:

Later

kedro kedro copied to clipboard

[Parent] - Notebook/IPython Debug line magic `%load_node` feature discussion thread

Description

What does this feature do?

Comments

IPython

Support more platform

Two-way sync

Recursive definition of function body

Edge Case improvement

Context for Tech Design - 31/1/24

Feature motivation - #1832

Steps to test the MVP

The current implementation

What makes a complete feature?

Consider: Extending the current implementation

1. If a node input is a MemoryDataset

2. Node references an object defined within the source-file but outside of node scope

3. Return statements make unrunnable cells

Consider: Supporting Other Platforms

Consider: Supporting the node-finding process by making changes to the default node name

[Extra] Consider: The future of notebook debugging

Extending the current implementation

Supporting Other Platforms

The future of notebook debugging

Extending the current implementation

Supporting Other Platforms

Tech Design Summary

Now (part of #3510):

Next:

Later

kedro
kedro copied to clipboard