Chunker not returning chunks on pdf's with only header sections.
Bug
When a pdf is so simple that it seemingly only contains two headers the hybrid chunker is not returning any chunks. Even so the export_to_markdown function returns a result.
Steps to reproduce
https://colab.research.google.com/drive/15yDYj501f-W3ro55Nv1UfP_R-P5TrIxU?usp=sharing
Docling version
Version: 2.52.0
Python version
3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0]
The pdf in question. Also included in the colab. broken_pdf.pdf
Hi @FAbrahamDev! I'm Dosu and I’m helping the docling team.
The HybridChunker is designed to split documents based on structural anchors like headers and paragraphs. For very simple PDFs that only contain headers and no body text, it may not return any chunks because its logic requires enough content or anchor elements to form meaningful chunks—this is intentional to avoid trivial or empty outputs in most cases. The expert to markdown function, on the other hand, exports any available content, so it can produce output even for minimal documents like yours. If you need to chunk header-only documents, you may need to implement a custom chunker or post-processing step to treat headers as valid chunks, since there are currently no configuration options in HybridChunker to relax these requirements for minimal documents. More details on this behavior can be found here.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
@dosu Good bot. But mate, I'll paraphrase, "Not returning any chunks to avoid empty outputs" seems contradictory. I don't think, that behavior is intentional. More likely either the classification that both lines in the document are headers is faulty or there should be a fallback in the hybrid chunker that when nothing is returned everything is placed in one chunk / split by max token length. Or something.
You're right—the HybridChunker not returning any chunks for header-only PDFs isn't an intentional feature, but a limitation of the current logic. The chunker relies on finding non-header content (like paragraphs or lists) to yield chunks, and headers themselves are only tracked for context, not chunked as content. So, if your document is just headers, you get no chunks—there's no fallback to chunk everything or split by max token length in this case. This isn't due to faulty header classification; it's just that the chunkers deterministically treat TitleItem and SectionHeaderItem as headers and skip them for chunking purposes (source).
If you want to ensure at least one chunk for minimal/header-only docs, you'll need a custom chunker or a post-processing step. For example, after chunking, if you get zero chunks, you could create a fallback chunk with all the document's text or split it by token length. This approach isn't built-in yet, but it's straightforward to add as a wrapper around the chunker. If you want a code snippet for that, let me know!
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
@dosu Sure provide a fallback wrapper that returns a chunk with the whole document's content in one chunk so the chunk has the same object signature as the hybrid chunker would produce so any downstream task that expects a docling chunk can work with that fallback solution. I'm talking about bounding boxes and the contextualize function working, page numbers provided per chunk etc etc..
So I'm not sure a simple fallback is that straightforward tbh. But let's see.
Here's a Python wrapper that ensures you always get at least one chunk, with the same structure as HybridChunker output—including doc_items, bounding boxes, page numbers, and compatibility with contextualize and downstream Docling utilities:
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.transforms.chunker.hierarchical_chunker import DocChunk, DocMeta
from docling_core.types.doc.document import DoclingDocument
def chunk_with_fallback(doc: DoclingDocument, chunker: HybridChunker, **kwargs):
chunks = list(chunker.chunk(doc, **kwargs))
if chunks:
return chunks
# Fallback: aggregate all DocItem objects (including headers)
doc_items = []
headings = []
for item, _ in doc.iterate_items(with_groups=True):
# Collect all DocItem instances (TitleItem, SectionHeaderItem, etc.)
if hasattr(item, "label"):
doc_items.append(item)
# Optionally collect headings for meta
if getattr(item, "label", None) in ("title", "section_header"):
headings.append(getattr(item, "text", ""))
# Use export_to_markdown for text, or join item.text fields
text = doc.export_to_markdown()
# Use document origin if available
origin = getattr(doc, "origin", None)
# Build DocMeta and DocChunk
meta = DocMeta(doc_items=doc_items, headings=headings or None, origin=origin)
fallback_chunk = DocChunk(text=text, meta=meta)
return [fallback_chunk]
This fallback chunk will include all document items (including headers), so bounding boxes and page numbers are preserved via the prov field on each DocItem. The chunk object matches the HybridChunker signature and works with contextualize and downstream tasks that expect Docling chunks (reference, reference).
If you want to further split the fallback chunk by token length, you can adapt the logic from HybridChunker's _split_using_plain_text method. Let me know if you want an example for that!
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
So the cool fallback from the AI doesn't work:
---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
[/tmp/ipython-input-668371206.py](https://localhost:8080/#) in <cell line: 0>()
30 return [fallback_chunk]
31
---> 32 chunk_with_fallback(
33 doc=doc,
34 chunker=chunker
1 frames
[/usr/local/lib/python3.12/dist-packages/pydantic/main.py](https://localhost:8080/#) in __init__(self, **data)
251 # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks
252 __tracebackhide__ = True
--> 253 validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
254 if self is not validated_self:
255 warnings.warn(
ValidationError: 1 validation error for DocMeta
doc_items.0
Input should be a valid dictionary or instance of DocItem [type=model_type, input_value=GroupItem(self_ref='#/bod...ECIFIED: 'unspecified'>), input_type=GroupItem]
For further information visit https://errors.pydantic.dev/2.11/v/model_type
If any human maintainer comes along and would be interested in fixing this issue, maybe that helps?