kedro
kedro copied to clipboard
Low performance of pipeline sums
Description
I have a project where there is a huge number of pipelines generated programatically (in a loop). The process of generating those pipelines takes a lot of time and it seems to be quadratic (see the chart below).
n
- number of pipelines to sum
time
- time in seconds
The problem has 2 variants:
- Large number of small pipelines
- Small number of pipelines with large node count (200+).
Context
While Kedro encourages to keep the nodes small and pipelines modular - extensive use of both of those features/approaches lead to slow project startup times.
The most severe impact of this issue is in mono-repo setups, where multiple teams work in the same project but on separate pipelines - in such setups the number of pipelines grows quickly as the development proceeds.
Steps to Reproduce
- Create a project from
spaceflights
starter. - Change
data_processing
pipeline to:
Show the code ⬇️
def create_pipeline(**kwargs) -> Pipeline:
data_engineering_pipeline = pipeline(
[
node(
func=preprocess_companies,
inputs="companies",
outputs="preprocessed_companies",
name="preprocess_companies_node",
),
node(
func=preprocess_shuttles,
inputs="shuttles",
outputs="preprocessed_shuttles",
name="preprocess_shuttles_node",
),
node(
func=create_model_input_table,
inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
outputs="model_input_table",
name="create_model_input_table_node",
),
] + [node(
func=lambda x: print("YOLO", x),
inputs="parameters",
outputs=f"yolo_{i}",
name=f"yolo_{i}"
) for i in range(200)]
)
# Poor man's performance test
import time
pipelines = []
MAX = 60
for i in range(MAX + 1):
pipelines.append(
pipeline(
data_engineering_pipeline,
inputs={"companies": "companies",
"shuttles": "shuttles",
"reviews": "reviews"},
namespace=f"namespace_{i}",
)
)
data = []
for n in range(1, MAX, 10):
start = time.monotonic()
_ = sum(pipelines[:n])
end = time.monotonic()
print(f"Sum of {n} pipelines took: {end - start:0.3f}s")
data.append((n, end - start))
# uncomment to output chart / data
# import pandas as pd
# df = pd.DataFrame(data, columns=["n", "time"])
# df.plot.scatter(x="n", y="time").get_figure().savefig("plot.png")
return sum(pipelines)
- Run
kedro registry list
Expected Result
Pipelines are listed quickly.
Actual Result
The pipelines are listed after a few minutes (depending on the number of pipelines/nodes), with the time increasing quadratically (see the chart above).
Possible causes
The main problem is that internally, the pipelines are summed __add__
and then __init__
in the Pipeline
class. The slowness of the operations inside of the __add__
itself is partially addressed by #3146 but the problem with the __init__
still remains - maybe the calls to _topologically_sorted
in the constructor are the root cause. It would require more detailed profiling.
Your Environment
- Kedro version used:
0.18.13
- Python version used:
3.10.13
- Operating system and version:
macOS 13.0.1
Was this fully addressed by #3730?
I hope so 🤞🏻