docling
docling copied to clipboard
Overlapping layout clusters
Bug
- I see cases where I get overlapping clusters - causing cells to be duplicated
- I see cases where cells are assigned to wrong cluster
Steps to reproduce
just run the minimal.py example and examine the result structure
Docling version
Docling version: 2.14.0 Docling Core version: 2.12.1 Docling IBM Models version: 3.1.0 Docling Parse version: 3.0.0
Python version
Python 3.11.4
@InbarShapira can you please deliver some detail which overlapping clusters and duplicated cells you are seeing with the example, that would help. You can also use the docling CLI with debug visualizations enabled, such as:
docling --debug-visualize-cells --debug-visualize-layout your_file.pdf
I've run into a similar situation. Unfortunately, I can't share my documents, but I can maybe offer some diagnostic information:
- I have scanned PDFs, and have the
force_full_page_ocrflag on. - There are images of tables taken from other documents in the scanned PDF
- The images of tables are inverted (white text on black background)
- The original document is not inverted (black text on white background)
- The layout model classifies such regions as both a picture and a table, resulting in overlapped clusters.
For others who need a stopgap solution:
- I used a custom PDF pipeline and added a step in the
build_piperight before thePageAssemblemodel to remove overlapping clusters. - Overlaps are detected using
box_iouscores with a threshold of0.95. This is to filter situations where a small picture is entirely contained in a table. - When 2 clusters are overlapped, I remove the cluster with a lower confidence score.