Sentence-level bounding boxes are too large - include entire paragraphs instead of chunk text
Context: Using HybridChunker with a small chunk size (approximately one sentence per chunk) to extract sentence-level text chunks from PDFs.
Problem: The bounding boxes assigned to these small chunks cover entire paragraphs instead of just the chunk text. This occurs even when the chunk contains only a single sentence.
Question: Is it possible to get bounding boxes that exactly match the chunk text, rather than the paragraph? More generally, is it possible to get bounding boxes smaller than paragraph-level (e.g., sentence-level or word-level)?
Hi @sebihoefle! I'm Dosu and I’m helping the docling team.
HybridChunker assigns bounding boxes based on document nodes like paragraphs, so even with sentence-level chunking, the bounding boxes often cover entire paragraphs. This happens because the bounding box for a chunk is merged from the bounding boxes of all text cells in the group, which usually means the whole paragraph is included, not just the sentence chunked out of it (source).
To get bounding boxes that match sentence-level chunks, you need to preprocess your document: split paragraphs into sentences using a tool like spaCy or NLTK, then replace each paragraph node with sentence-level nodes before chunking. This way, each chunk corresponds to a sentence, and its bounding box will be more precise (source).
For word-level bounding boxes (native PDF, not OCR), set pipeline_options.generate_parsed_pages = True and access result.pages[page_num].parsed_page, which exposes detailed bounding box data at the word level (source). This isn't available by default in the standard PDF-to-JSON output, so you'll need custom scripting to extract and serialize these word-level bounding boxes (source).
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
@sebihoefle The bounding boxes docling infers for elements on a page are paragraph-scoped for text. If a chunk is created with a subset of a paragraph (e.g. sentence level), it can not be determined afterwards where in the full paragraph box this sentence is located exactly, so the returned information is the box of the full paragraph. This is by design.