docling icon indicating copy to clipboard operation
docling copied to clipboard

Chunking Hacker News discussion page: HybridChunker hangs

Open miohtama opened this issue 2 months ago • 8 comments

Question

Thank you for a great library.

I am using Docling 2.57.0.

I am attempting to import some web pages data using Docling (through Haiku RAG library). I have encountered a web page that is 1) not very complicated (3.5MB) 2) causes Docling chunker to hang.

The page causes some pathological behaviour, and HybridChunker hangs (never returns). I am running on a powerful Macbook M3 laptop. I assume this is because the Hacker News page in question uses the legacy HTML <table> element extensively in its web page layout.

This particular page is this Hacker News discussion page.

Below is a Python code to repeat the issue.

My goal is not to have the software hang under any circumstances.

My question is

  • Is this a bug or a feature
  • If it is a feature, how could I improve Docling's robustness so that I have some sort of fallback for a chunker / converter that would not hang if we detect such a web page that is dangerous to feed into it

hacker-news-discussion.html

Code to repeat:

"""A discussion heavy Hacker News web page hangs/overloads Docling."""
from pathlib import Path

import tiktoken
from docling.chunking import HybridChunker  # type: ignore
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
from docling_core.types import DoclingDocument

chunk_size = 256
source_path = Path.cwd() / "deps" / "haiku.rag" / "examples" / "samples" / "hacker-news-discussion.html"

print(f"File size is {source_path.stat().st_size / 1024:.2f} KB")

tokenizer = OpenAITokenizer(
    tokenizer=tiktoken.encoding_for_model("gpt-4o"),
    max_tokens=chunk_size
)
chunker = HybridChunker(tokenizer=tokenizer)

converter = DocumentConverter()
conversion_result = converter.convert(source_path)
docling_document: DoclingDocument = conversion_result.document

print(f"Docling document has {len(docling_document.texts)} texts, {len(docling_document.tables)} tables, {len(docling_document.pictures)} pictures.")

# Too much for chunker to handle
print("Starting chunking...")
chunks = list(chunker.chunk(docling_document))

print(f"Generated {len(chunks)} chunks")

Example run:

2025-10-22 19:06:20,267 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2025-10-22 19:06:21,771 - INFO - Going to convert document batch..., docs in batch: 1, doc_batch_concurrency: 1, doc_batch_size: 1
2025-10-22 19:06:21,771 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-10-22 19:06:21,778 - INFO - Loading plugin 'docling_defaults'
2025-10-22 19:06:21,778 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-22 19:06:21,778 - INFO - Processing document hacker-news-discussion.html
2025-10-22 19:06:42,491 - INFO - Finished converting document hacker-news-discussion.html in 22.23 sec.
Docling document has 22395 texts, 1904 tables, 7622 pictures.
Starting chunking...

The original haiku.rag issue here: https://github.com/ggozad/haiku.rag/issues/112

CC @ggozad

The page in the question attached as HTML doc:

hacker-news-discussion.html

miohtama avatar Oct 22 '25 17:10 miohtama

Hi @miohtama! I'm Dosu and I’m helping the docling team.

This is a known limitation, not an intended feature. HybridChunker can hang or fail on complex HTML documents with many tables, especially when table parsing or chunking logic encounters pathological structures or excessive nesting. There are no built-in timeouts, chunk/table limits, or fallback mechanisms to guarantee chunking completes for such cases—this is a recognized gap in robustness (source, source, source).

Recent work has improved table handling, but chunking large or complex HTML remains challenging (source). Users have requested features like bypassing semantic chunking for tables and adding fallback wrappers to ensure at least one chunk is returned, but these are not yet built-in (source, source).

To improve robustness, you can wrap the chunker with a fallback: after chunking, if no chunks are returned (or if chunking hangs), return a chunk with the whole document's content or split by token length. You can also preprocess HTML to simplify tables or use custom serializers (source). If you want a Python example for a fallback wrapper, let me know!

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Oct 22 '25 17:10 dosubot[bot]

Just for the context, I am using Docling 2.57.0. Might this have been caused by recent updates here?

https://github.com/docling-project/docling/issues/2510#issuecomment-3433396913

miohtama avatar Oct 23 '25 07:10 miohtama

Recent updates in Docling 2.57.0 improved support for rich and nested tables, but did not fully resolve hangs or robustness issues with HybridChunker on large, table-heavy HTML pages. The hanging behavior is a longstanding limitation, not something newly introduced in 2.57.0. There are still no built-in timeouts, chunk/table limits, or fallback mechanisms to guarantee chunking completes for pathological documents—this is a recognized gap in robustness. Workarounds include preprocessing HTML to simplify tables or wrapping the chunker with a fallback that returns a basic chunk if chunking fails or hangs. For more details, see the discussion and recent changes here and here.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Oct 23 '25 07:10 dosubot[bot]

@miohtama thanks for reporting this issue. We are currently checking the performance of the HTML table parsing, which got updated in the recent releases and could be the root cause of your issues with the HybridChunker. (#2509 )

ceberam avatar Oct 23 '25 08:10 ceberam

@ceberam Awesome. Let me know if I can help any way.

miohtama avatar Oct 24 '25 10:10 miohtama

Thanks @miohtama , we are currently fixing the root cause in the HTML processing (basically the table parsing) and we will provide an update in the next few days

ceberam avatar Oct 24 '25 10:10 ceberam

@miohtama We just released a new version of docling, v2.61.1 that increases the parsing performance of HTML files. I went through the steps you described and I came across the same issue, since it is quite clear that the root cause is the ability of the HybridChunker to deal with many large tables. I think it would be good to have that sort of fallback for the chunker, specially on certain types of node items, as pointed out in #1831 . We will need to tackle this and you are more than welcome to contribute if you like.

ceberam avatar Nov 06 '25 09:11 ceberam

Good morning! Looks much better.

For this particular page, the process completes. However, it still takes half an hour, so it is probably not a general solution for all problematic web pages.

I feel as a workaround, I could try to just skip pages based on the table element count e.g. if there are > 150 tables, don't try to index the page until we have a fallback available.

File size is 3115.24 KB
2025-11-30 14:05:03,948 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2025-11-30 14:05:06,700 - INFO - Going to convert document batch...
2025-11-30 14:05:06,702 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-11-30 14:05:06,707 - INFO - Loading plugin 'docling_defaults'
2025-11-30 14:05:06,708 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-11-30 14:05:06,708 - INFO - Processing document hacker-news-discussion.html
2025-11-30 14:05:28,041 - INFO - Finished converting document hacker-news-discussion.html in 24.09 sec.
Docling document has 22420 texts, 1904 tables, 7628 pictures.
Starting chunking...
Done in 1510.48 seconds

miohtama avatar Dec 01 '25 09:12 miohtama