kedro
kedro copied to clipboard
[PARENT] %load_node line magic improvements
Description
In #3510 we introduce a new line magic, aimed at improving the process of debugging Kedro projects in notebooks. This feature is experimental - this issue should be used to add suggestions for extending and improving it. Add a suggestion in the comments, or if already mentioned, bump its priority with a 👍 .
(edited by Nok)
- [ ] Add support for other platforms, Support notebook/lab/vscode/ipython now, minimal support for Databricks and others
- [ ] Add import statement to import * from node source file - allows nodes with helper functions to be runnable in notebooks
- [ ] better handle of MemoryDataset, or guide user to run the necessary node to persist data
- [ ] #3629
- [ ] Consider
before_node_run
,after_node_run
Add support for other platforms, currently only supports jupyter lab/notebook (#3510) and ipython (#3536). Consider including:
- Databricks
- VSCode
Add import statement to import * from node source file - allows nodes with helper functions to be runnable in notebooks without having to go back to source files and copy paste the code over
Edited by Nok below:
if we can use insppect.getsourcefile
, we can directly import the module with importlib
, then we can use from <module> import *
to make sure everything is loaded.
https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly
Resolve MemoryDatasets so that users don't have to add them to catalog to access them as node inputs
if we can use insppect.getsourcefile
, we can directly import the module with importlib
, then we can use from <module> import *
to make sure everything is loaded.
https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly
Cc @DimedS
, it works well on my side! This new functionality is fantastic! It would be helpful to include instructions in the documentation for this new command on how to use it in different environments. Additionally, perhaps we should consider adding relative functions when loading specific nodes. Currently, in spaceflights-pandas, a node is loaded but cannot be run because it returns a
NameError: name '_is_true' is not defined
error, indicating that the function_is_true
was not loaded. https://github.com/kedro-org/kedro/pull/3604#pullrequestreview-1882976670
I find a couple of things to improve when I try to help an user to debug on 0.18.x
- When running %load_node on IPython, if the code block is long enough, the top of the code block will "disappear" unless you hit the arrow. At first I thought the function is broken. This is confusing because most likely the function will not run (assuming there are some error and user want to debug), they will not see the variable declaration in the terminal
data_a = catalog.load("xxx")
- It seems broken if the function is from "wheel" or somehow wrapped, cannot reproduce this yet but the error is "FileNotFoundError: [Errno 2] No such file or directory: '<boltons.funcutils.FunctionBuilder-115>'"
- Maybe prepare some standalone script so that we can test 0.18.x
- The function call always have the full function signature, which is problematic if some of them is optional argument.
i.e. def node_func(a,b, c=None): return ...
It should be valid to have a node node(node_func, inputs=["data_a","data_b"], ...)
. Currently the result code block is
node_func(a,b,c)
This will cause error because c
is not defined, this will work as long as we delete c
from the result code block.
Better handle of *args
and **kwargs
, currently the %load_node
have a simple logic to map node's input to function parameters.
The idea is to use inspect.Signature.bind
and inspect.Parameters
to identify the special arguments (VAR_POSITIONAL)
For example:
def dummy(a,b,c, *args, **kwargs):
...
node(dummy, ["data_1", "data_2", "data_3", "dummy1","dummy2","dummy3"], ...)
should translate to
a = catalog.load("data_1")
b = catalog.load("data_2")
c = catalog.load("data_3")
dummy1 = catalog.load("dummy1")
dummy2 = catalog.load("dummy2")
dummy3 = catalog.load("dummy3")
args = [dummy1, dummy2, dummy3] # Noted here the name of the "dummy_x" variable are arbitrary
dummy(a, b, c, *args)
Consider adding before_node_run
and after_node_run
. If user mutates the inputs with hooks, the current logic fails to do so. For example, some users have a custom ConfigLoader that only instantiate object with before_node_run
, so catalog.load
only return a dict object but not the instantiated class.