kedro icon indicating copy to clipboard operation
kedro copied to clipboard

[PARENT] %load_node line magic improvements

Open AhdraMeraliQB opened this issue 1 year ago • 7 comments

Description

In #3510 we introduce a new line magic, aimed at improving the process of debugging Kedro projects in notebooks. This feature is experimental - this issue should be used to add suggestions for extending and improving it. Add a suggestion in the comments, or if already mentioned, bump its priority with a 👍 .

(edited by Nok)

AhdraMeraliQB avatar Jan 31 '24 16:01 AhdraMeraliQB

Add support for other platforms, currently only supports jupyter lab/notebook (#3510) and ipython (#3536). Consider including:

  • Databricks
  • VSCode

AhdraMeraliQB avatar Jan 31 '24 16:01 AhdraMeraliQB

Add import statement to import * from node source file - allows nodes with helper functions to be runnable in notebooks without having to go back to source files and copy paste the code over

Edited by Nok below: if we can use insppect.getsourcefile, we can directly import the module with importlib, then we can use from <module> import * to make sure everything is loaded. https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly

AhdraMeraliQB avatar Jan 31 '24 16:01 AhdraMeraliQB

Resolve MemoryDatasets so that users don't have to add them to catalog to access them as node inputs

AhdraMeraliQB avatar Jan 31 '24 16:01 AhdraMeraliQB

if we can use insppect.getsourcefile, we can directly import the module with importlib, then we can use from <module> import * to make sure everything is loaded. https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly Cc @DimedS

, it works well on my side! This new functionality is fantastic! It would be helpful to include instructions in the documentation for this new command on how to use it in different environments. Additionally, perhaps we should consider adding relative functions when loading specific nodes. Currently, in spaceflights-pandas, a node is loaded but cannot be run because it returns a NameError: name '_is_true' is not defined error, indicating that the function _is_true was not loaded. https://github.com/kedro-org/kedro/pull/3604#pullrequestreview-1882976670

noklam avatar Feb 15 '24 16:02 noklam

I find a couple of things to improve when I try to help an user to debug on 0.18.x

  • When running %load_node on IPython, if the code block is long enough, the top of the code block will "disappear" unless you hit the arrow. At first I thought the function is broken. This is confusing because most likely the function will not run (assuming there are some error and user want to debug), they will not see the variable declaration in the terminal data_a = catalog.load("xxx")
  • It seems broken if the function is from "wheel" or somehow wrapped, cannot reproduce this yet but the error is "FileNotFoundError: [Errno 2] No such file or directory: '<boltons.funcutils.FunctionBuilder-115>'"
  • Maybe prepare some standalone script so that we can test 0.18.x
  • The function call always have the full function signature, which is problematic if some of them is optional argument.

i.e. def node_func(a,b, c=None): return ...

It should be valid to have a node node(node_func, inputs=["data_a","data_b"], ...). Currently the result code block is

node_func(a,b,c)

This will cause error because c is not defined, this will work as long as we delete c from the result code block.

noklam avatar Feb 15 '24 18:02 noklam

Better handle of *args and **kwargs, currently the %load_node have a simple logic to map node's input to function parameters.

The idea is to use inspect.Signature.bind and inspect.Parameters to identify the special arguments (VAR_POSITIONAL)

For example:

def dummy(a,b,c, *args, **kwargs):
    ...

node(dummy, ["data_1", "data_2", "data_3", "dummy1","dummy2","dummy3"], ...)

should translate to

a = catalog.load("data_1")
b = catalog.load("data_2")
c = catalog.load("data_3")
dummy1 = catalog.load("dummy1")
dummy2 = catalog.load("dummy2")
dummy3 = catalog.load("dummy3")
args = [dummy1, dummy2, dummy3] # Noted here the name of the "dummy_x" variable are arbitrary
dummy(a, b, c, *args)

noklam avatar Feb 19 '24 11:02 noklam

Consider adding before_node_run and after_node_run. If user mutates the inputs with hooks, the current logic fails to do so. For example, some users have a custom ConfigLoader that only instantiate object with before_node_run, so catalog.load only return a dict object but not the instantiated class.

noklam avatar Feb 19 '24 11:02 noklam