docling icon indicating copy to clipboard operation
docling copied to clipboard

Overlapping layout clusters

Open InbarShapira opened this issue 11 months ago • 2 comments

Bug

  1. I see cases where I get overlapping clusters - causing cells to be duplicated
  2. I see cases where cells are assigned to wrong cluster

Steps to reproduce

just run the minimal.py example and examine the result structure

Docling version

Docling version: 2.14.0 Docling Core version: 2.12.1 Docling IBM Models version: 3.1.0 Docling Parse version: 3.0.0

Python version

Python 3.11.4

InbarShapira avatar Jan 14 '25 14:01 InbarShapira

@InbarShapira can you please deliver some detail which overlapping clusters and duplicated cells you are seeing with the example, that would help. You can also use the docling CLI with debug visualizations enabled, such as:

docling --debug-visualize-cells --debug-visualize-layout your_file.pdf

cau-git avatar Jan 14 '25 14:01 cau-git

I've run into a similar situation. Unfortunately, I can't share my documents, but I can maybe offer some diagnostic information:

  1. I have scanned PDFs, and have the force_full_page_ocr flag on.
  2. There are images of tables taken from other documents in the scanned PDF
  3. The images of tables are inverted (white text on black background)
  4. The original document is not inverted (black text on white background)
  5. The layout model classifies such regions as both a picture and a table, resulting in overlapped clusters.

For others who need a stopgap solution:

  1. I used a custom PDF pipeline and added a step in the build_pipe right before the PageAssemble model to remove overlapping clusters.
  2. Overlaps are detected using box_iou scores with a threshold of 0.95. This is to filter situations where a small picture is entirely contained in a table.
  3. When 2 clusters are overlapped, I remove the cluster with a lower confidence score.

clovisNyu avatar Feb 05 '25 03:02 clovisNyu