kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Low performance of pipeline sums

Open marrrcin opened this issue 1 year ago • 1 comments

Description

I have a project where there is a huge number of pipelines generated programatically (in a loop). The process of generating those pipelines takes a lot of time and it seems to be quadratic (see the chart below).

plot n - number of pipelines to sum time - time in seconds

The problem has 2 variants:

  1. Large number of small pipelines
  2. Small number of pipelines with large node count (200+).

Context

While Kedro encourages to keep the nodes small and pipelines modular - extensive use of both of those features/approaches lead to slow project startup times.

The most severe impact of this issue is in mono-repo setups, where multiple teams work in the same project but on separate pipelines - in such setups the number of pipelines grows quickly as the development proceeds.

Steps to Reproduce

  1. Create a project from spaceflights starter.
  2. Change data_processing pipeline to:
Show the code ⬇️
def create_pipeline(**kwargs) -> Pipeline:
    data_engineering_pipeline = pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ] + [node(
            func=lambda x: print("YOLO", x),
            inputs="parameters",
            outputs=f"yolo_{i}",
            name=f"yolo_{i}"
        ) for i in range(200)]
    )

    # Poor man's performance test
    import time
    pipelines = []
    MAX = 60
    for i in range(MAX + 1):
        pipelines.append(
            pipeline(
                data_engineering_pipeline,
                inputs={"companies": "companies",
                        "shuttles": "shuttles",
                        "reviews": "reviews"},
                namespace=f"namespace_{i}",
            )
        )
    data = []
    for n in range(1, MAX, 10):
        start = time.monotonic()
        _ = sum(pipelines[:n])
        end = time.monotonic()
        print(f"Sum of {n} pipelines took: {end - start:0.3f}s")
        data.append((n, end - start))


    # uncomment to output chart / data
    # import pandas as pd
    # df = pd.DataFrame(data, columns=["n", "time"])
    # df.plot.scatter(x="n", y="time").get_figure().savefig("plot.png")
    return sum(pipelines)
  1. Run kedro registry list

Expected Result

Pipelines are listed quickly.

Actual Result

The pipelines are listed after a few minutes (depending on the number of pipelines/nodes), with the time increasing quadratically (see the chart above).

Possible causes

The main problem is that internally, the pipelines are summed __add__ and then __init__ in the Pipeline class. The slowness of the operations inside of the __add__ itself is partially addressed by #3146 but the problem with the __init__ still remains - maybe the calls to _topologically_sorted in the constructor are the root cause. It would require more detailed profiling.

Your Environment

  • Kedro version used: 0.18.13
  • Python version used: 3.10.13
  • Operating system and version: macOS 13.0.1

marrrcin avatar Oct 12 '23 11:10 marrrcin

Was this fully addressed by #3730?

astrojuanlu avatar May 13 '24 08:05 astrojuanlu

I hope so 🤞🏻

marrrcin avatar May 13 '24 11:05 marrrcin