Add something like a `DocumentLengthRouter` which routes documents based on whether they are empty or not

Open sjrl opened this issue 6 months ago • 0 comments

Often when working with PDF files we run into complex scenarios with regards to non-text content. For example,

The whole PDF could be a scan so it has no text content to extract with our normal PDF converters
- Or only some pages in the PDF could be scans
Certain pages in the PDF could be images so contain no textual content

In these scenarios it would ideal if we could route the Documents produced by these PDFs (such as right after our PyPDFConverter or our DocumentSplitter) into a router component that determines if the PDF is empty or not. If it's empty we might like to route it to either an OCRConverter or an ImageEmbedder so we can still utilize the information from this PDF.

Currently we handle this in platform by using two OutputAdapters which filter a document based on content length. Here is the yaml config of these components shown below.

    ContentFilter:
      type: haystack.components.converters.output_adapter.OutputAdapter
      init_parameters:
        output_type: List[haystack.Document]
        unsafe: true
        template: |
          {%- set non_empty_docs = [] -%}
          {%- for doc in documents -%}
            {%- set _ = non_empty_docs.append(doc) if doc.content|length > 1 else None -%}
          {%- endfor -%}
          {{ non_empty_docs }}

    NoContentFilter:
      type: haystack.components.converters.output_adapter.OutputAdapter
      init_parameters:
        output_type: List[haystack.Document]
        unsafe: true
        template: |
          {%- set empty_docs = [] -%}
          {%- for doc in documents -%}
            {%- set _ = None if doc.content|length > 1 else empty_docs.append(doc) -%}
          {%- endfor -%}
          {{ empty_docs }}

Instead of requiring complex OutputAdapters to accomplish this (or a similarly complex ConditionalRouter) I think we should make a Router called something like DocumentLengthRouter (I'm open to different names) that takes in documents and routes them to two output edges: non_empty_docs and empty_docs. Then users can decide what to do with the empty docs (e.g. OCR conversion, Image embedding, etc.)

This would most likely be done when preprocessing and indexing files so it is relevant to https://github.com/deepset-ai/haystack/issues/9321

Jun 13 '25 09:06 sjrl