haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Enable grouping of pipelines via Sub-pipelines

Open ArzelaAscoIi opened this issue 9 months ago • 2 comments

Is your feature request related to a problem? Please describe. Problem statement: Pipelines can grow and you want to have certain grouping (e.g. sub tasks like retrieval) When creating the pipeline you want to reduce the mental load by solving smaller challenges (e.g. retrieval) and then move on with the next part. Additionally this part might stay the same across multiple pipelines and you want to reuse it.

Motivation for "sub-pipelines". When we shared haystack with the customers - providing only two concepts pipelines and components resonated very well with them.

Describe the solution you'd like Why do we limit ourselves to only connecting components and allow pipelines to be used within the add_component call (rename to just “add”). This would allow grouping, reusing pipeline templates within projects and having less concepts to learn that not generally apply to how graph execution engines work.

Describe alternatives you've considered SuperComponents, containing again pipelines.

Additional context Pseudo code might look like this:

from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.core.pipeline import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Random pipeline handling any sort of retrieval
retriever_pipeline = Pipeline()
retriever_pipeline.add("retriever", InMemoryBM25Retriever(document_store=InMemoryDocumentStore()))


# Rag pipeline that uses a generic retriever pipeline
rag_pipeline = Pipeline()
rag_pipeline.add("retriever_pipeline", retriever_pipeline)  # add_component => add
rag_pipeline.add("llm", OpenAIChatGenerator())
rag_pipeline.add("prompt_builder", ChatPromptBuilder(template=...))
rag_pipeline.connect(
    "retriever_pipeline.documents", "prompt_builder.documents"
)  # connect a pipelines output to another component input of a component
rag_pipeline.connect("prompt_builder", "llm")

Serialized version might look like this:

# output after calling rag_pipeline.dumps()
components:
  llm:
    type: haystack.components.generators.chat.OpenAIChatGenerator
    init_parameters:
      model_name_or_path: gpt-4o-mini-2024-07-18
  prompt_builder:
    type: haystack.components.builders.ChatPromptBuilder
    init_parameters:
      template: ...

pipelines:
  - name: retriever_pipeline
    components:
      - retriever:
          type: haystack.components.retrievers.in_memory.InMemoryBM25Retriever
          init_parameters:
            document_store:
              type: haystack.document_stores.in_memory.InMemoryDocumentStore
              ....

connections:
  - receiver: pipelines.retriever_pipeline.documents # Optional pipelines prefix if name ambiguous
    sender: components.prompt_builder.documents # Optional sender prefix if name ambiguous
  - receiver: llm.documents
    sender: prompt_builder.documents

max_loops_allowed: 100
metadata: {}

ArzelaAscoIi avatar Mar 21 '25 15:03 ArzelaAscoIi

Thanks for sharing this idea! Instead of renaming add_component to add, we could also newly introduce an add_pipeline and an add. The latter could call either add_component or add_pipeline depending on the input parameters it received.

julian-risch avatar Mar 24 '25 08:03 julian-risch

Here's some additional thinking about how existing SuperComponents could be transformed into sub-pipelines:

Current state:

SuperComponent's responsibilities are:

  • wrapping and running a pipeline within a pipeline
  • mapping inputs and outputs of the wrapped pipeline
  • generating a pipeline based on (more abstract) init parameters (factory-like)

To flesh out the idea above:

  • move the first 2 responsibilities to Pipeline or an extended version of Pipeline (could be inherited or not):
    • allow calling sub- pipelines within pipelines explicitly without a wrapping component
    • adding inputs and output mappings which are handled by pipeline.run automatically (makes sense anyways, we have this additional mapping code implemented in our query-api and indexing consumer)
  • convert all implemented SuperComponents into pipeline factories
    • these factories will return pipeline objects
    • the returned pipeline objects contain metadata about it's origin, e.g. factory parameters (so they can be easily recreated using the factory)

Main benefits:

  • We get rid of yet another abstraction "SuperComponent" that needs to be understood and handled separately
  • reusable patterns of pipelines could be easily converted into sub pipelines and reused anywhere (without going through another concept: implemeting a supercomponent or factory)
  • building additional features (e.g. fitting API and UI concepts like visualizing subpipelines) on top of Haystack would most likely be faster as there are less moving parts to deal with

tstadel avatar May 22 '25 12:05 tstadel