docling icon indicating copy to clipboard operation
docling copied to clipboard

Chunker not returning chunks on pdf's with only header sections.

Open FAbrahamDev opened this issue 3 months ago • 6 comments

Bug

When a pdf is so simple that it seemingly only contains two headers the hybrid chunker is not returning any chunks. Even so the export_to_markdown function returns a result.

Steps to reproduce

https://colab.research.google.com/drive/15yDYj501f-W3ro55Nv1UfP_R-P5TrIxU?usp=sharing

Docling version

Version: 2.52.0

Python version

3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0]

The pdf in question. Also included in the colab. broken_pdf.pdf

FAbrahamDev avatar Sep 12 '25 12:09 FAbrahamDev

Hi @FAbrahamDev! I'm Dosu and I’m helping the docling team.

The HybridChunker is designed to split documents based on structural anchors like headers and paragraphs. For very simple PDFs that only contain headers and no body text, it may not return any chunks because its logic requires enough content or anchor elements to form meaningful chunks—this is intentional to avoid trivial or empty outputs in most cases. The expert to markdown function, on the other hand, exports any available content, so it can produce output even for minimal documents like yours. If you need to chunk header-only documents, you may need to implement a custom chunker or post-processing step to treat headers as valid chunks, since there are currently no configuration options in HybridChunker to relax these requirements for minimal documents. More details on this behavior can be found here.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Sep 12 '25 12:09 dosubot[bot]

@dosu Good bot. But mate, I'll paraphrase, "Not returning any chunks to avoid empty outputs" seems contradictory. I don't think, that behavior is intentional. More likely either the classification that both lines in the document are headers is faulty or there should be a fallback in the hybrid chunker that when nothing is returned everything is placed in one chunk / split by max token length. Or something.

FAbrahamDev avatar Sep 12 '25 12:09 FAbrahamDev

You're right—the HybridChunker not returning any chunks for header-only PDFs isn't an intentional feature, but a limitation of the current logic. The chunker relies on finding non-header content (like paragraphs or lists) to yield chunks, and headers themselves are only tracked for context, not chunked as content. So, if your document is just headers, you get no chunks—there's no fallback to chunk everything or split by max token length in this case. This isn't due to faulty header classification; it's just that the chunkers deterministically treat TitleItem and SectionHeaderItem as headers and skip them for chunking purposes (source).

If you want to ensure at least one chunk for minimal/header-only docs, you'll need a custom chunker or a post-processing step. For example, after chunking, if you get zero chunks, you could create a fallback chunk with all the document's text or split it by token length. This approach isn't built-in yet, but it's straightforward to add as a wrapper around the chunker. If you want a code snippet for that, let me know!

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Sep 12 '25 12:09 dosubot[bot]

@dosu Sure provide a fallback wrapper that returns a chunk with the whole document's content in one chunk so the chunk has the same object signature as the hybrid chunker would produce so any downstream task that expects a docling chunk can work with that fallback solution. I'm talking about bounding boxes and the contextualize function working, page numbers provided per chunk etc etc..

So I'm not sure a simple fallback is that straightforward tbh. But let's see.

FAbrahamDev avatar Sep 12 '25 12:09 FAbrahamDev

Here's a Python wrapper that ensures you always get at least one chunk, with the same structure as HybridChunker output—including doc_items, bounding boxes, page numbers, and compatibility with contextualize and downstream Docling utilities:

from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.transforms.chunker.hierarchical_chunker import DocChunk, DocMeta
from docling_core.types.doc.document import DoclingDocument

def chunk_with_fallback(doc: DoclingDocument, chunker: HybridChunker, **kwargs):
    chunks = list(chunker.chunk(doc, **kwargs))
    if chunks:
        return chunks

    # Fallback: aggregate all DocItem objects (including headers)
    doc_items = []
    headings = []
    for item, _ in doc.iterate_items(with_groups=True):
        # Collect all DocItem instances (TitleItem, SectionHeaderItem, etc.)
        if hasattr(item, "label"):
            doc_items.append(item)
            # Optionally collect headings for meta
            if getattr(item, "label", None) in ("title", "section_header"):
                headings.append(getattr(item, "text", ""))

    # Use export_to_markdown for text, or join item.text fields
    text = doc.export_to_markdown()

    # Use document origin if available
    origin = getattr(doc, "origin", None)

    # Build DocMeta and DocChunk
    meta = DocMeta(doc_items=doc_items, headings=headings or None, origin=origin)
    fallback_chunk = DocChunk(text=text, meta=meta)
    return [fallback_chunk]

This fallback chunk will include all document items (including headers), so bounding boxes and page numbers are preserved via the prov field on each DocItem. The chunk object matches the HybridChunker signature and works with contextualize and downstream tasks that expect Docling chunks (reference, reference).

If you want to further split the fallback chunk by token length, you can adapt the logic from HybridChunker's _split_using_plain_text method. Let me know if you want an example for that!

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Sep 12 '25 12:09 dosubot[bot]

So the cool fallback from the AI doesn't work:

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
[/tmp/ipython-input-668371206.py](https://localhost:8080/#) in <cell line: 0>()
     30     return [fallback_chunk]
     31 
---> 32 chunk_with_fallback(
     33     doc=doc,
     34     chunker=chunker

1 frames
[/usr/local/lib/python3.12/dist-packages/pydantic/main.py](https://localhost:8080/#) in __init__(self, **data)
    251         # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks
    252         __tracebackhide__ = True
--> 253         validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
    254         if self is not validated_self:
    255             warnings.warn(

ValidationError: 1 validation error for DocMeta
doc_items.0
  Input should be a valid dictionary or instance of DocItem [type=model_type, input_value=GroupItem(self_ref='#/bod...ECIFIED: 'unspecified'>), input_type=GroupItem]
    For further information visit https://errors.pydantic.dev/2.11/v/model_type

If any human maintainer comes along and would be interested in fixing this issue, maybe that helps?

FAbrahamDev avatar Sep 12 '25 14:09 FAbrahamDev