kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Allow for specifying extra node dependencies

Open lvijnck opened this issue 7 months ago • 2 comments

Description

I've always felt like Kedro misses the ability to specify additional dependencies among nodes, which are not dataset related.

Context

For instance, consider the problem of filling a knowledge graph though Kedro. Obviously, there's two main nodes:

  1. Write nodes
  2. Write edges

However, the edges cannot be written before the nodes were pushed. There is hence no "dataset" dependency between the nodes, but rather an execution dependency.

Possible Implementation

Adding this to Kedro would involve 1) addition to the node system and 2) and update to the topological execution mechanism. With respect to the nodes, dependencies could be specified as follows:

def create_pipeline(**kwargs) -> Pipeline:
    """Create embeddings pipeline."""
    return pipeline(
        [
            node(
                func=write_nodes,
                inputs=[
                    "int.nodes"
                ],
                outputs="prm.nodes",
                name="write_nodes",
            ),
            node(
                func=write_edges,
                inputs=[
                    "int.edges"
                ],
                outputs="prm.edges",
                name="write_edges",
                dependencies=["write_nodes"]
            )
       ]
  )

Possible Alternatives

The current work-around is to add "artificial" dataset dependencies among the nodes. This has the drawback that the function signatures of those nodes are polluted.

lvijnck avatar Jul 04 '24 13:07 lvijnck