docling icon indicating copy to clipboard operation
docling copied to clipboard

Issue with Extracting Tables with Merged Rows

Open MahmoudAtef999 opened this issue 1 year ago • 9 comments

Hello,

I’m encountering an issue when extracting tables containing merged rows. Specifically, when a cell spans multiple rows, the expected behavior is to assign it a row_span value greater than 1. However, in many cases, the extraction process fails to identify the correct row_span value, often assigning a lower value than the actual span. This results in blank cells appearing in the subsequent rows rather than merging as intended.

To address this, I tested with both do_cell_matching=True and do_cell_matching=False settings, and tried using both the DoclingParseDocumentBackend and DoclingParseV2DocumentBackend options. Unfortunately, neither approach yielded the correct row_span values or resolved the merging issue.

Attached are the following files for reference:

  • Sample PDF document with merged rows
  • Extracted output demonstrating the issue
  • Expected output showing the correct row_span values and row merges that Docling was unable to achieve

Attachments sample.pdf

extraction_output.csv

expected_output.csv

Thank you very much for your efforts on this project.

MahmoudAtef999 avatar Nov 02 '24 18:11 MahmoudAtef999

me too

DucHungGithub avatar Nov 07 '24 08:11 DucHungGithub

@MahmoudAtef999 thanks, I can reproduce this issue and will investigate further. The expectation should be that row spans are detected correctly here.

On a sidenote, the source of truth is the representation in DoclingDocument (or JSON), which you receive with the export_to_dict() method.

cau-git avatar Nov 11 '24 10:11 cau-git

@cau-git Thanks for your response. I've used DoclingDocument to extract the tables and converted them to both CSV and HTML formats. I also tried converting the entire file to JSON. However, the issue persists in both cases. I would appreciate any further guidance or steps I may have missed in troubleshooting.

MahmoudAtef4499 avatar Nov 13 '24 09:11 MahmoudAtef4499

Hello @cau-git, is there any update?

MahmoudAtef999 avatar Dec 27 '24 16:12 MahmoudAtef999

@MahmoudAtef999 We are in the process of re-training the table model, and your sample will act as a test case. There will be a future release that improves on the accuracy of row-spans.

I will close this issue for now until we have further news or the issue re-appears after the new model is out.

cau-git avatar Jan 30 '25 14:01 cau-git

@MahmoudAtef999 Please try with Docling v2.26.0, we updated Table model with new weights and it should address your issue.

maxmnemonic avatar Mar 17 '25 09:03 maxmnemonic

Hello @maxmnemonic ,

The issue has been resolved for some tables; however, a large percentage of files are still encountering the same problem.

I have attached a sample file that continues to face this issue.

I am using:

  • Python 3.12
  • docling 2.28.0
  • DoclingParseV4DocumentBackend

Thank you very much for your efforts and support.

result.csv sample.pdf

MahmoudAtef4499 avatar Mar 20 '25 11:03 MahmoudAtef4499

Thanks for the examples @MahmoudAtef4499 !

maxmnemonic avatar Mar 21 '25 10:03 maxmnemonic

@maxmnemonic Is there any update?

MahmoudAtef4499 avatar May 29 '25 09:05 MahmoudAtef4499