docling
docling copied to clipboard
Issue with Extracting Tables with Merged Rows
Hello,
I’m encountering an issue when extracting tables containing merged rows. Specifically, when a cell spans multiple rows, the expected behavior is to assign it a row_span value greater than 1. However, in many cases, the extraction process fails to identify the correct row_span value, often assigning a lower value than the actual span. This results in blank cells appearing in the subsequent rows rather than merging as intended.
To address this, I tested with both do_cell_matching=True and do_cell_matching=False settings, and tried using both the DoclingParseDocumentBackend and DoclingParseV2DocumentBackend options. Unfortunately, neither approach yielded the correct row_span values or resolved the merging issue.
Attached are the following files for reference:
- Sample PDF document with merged rows
- Extracted output demonstrating the issue
- Expected output showing the correct
row_spanvalues and row merges that Docling was unable to achieve
Attachments sample.pdf
Thank you very much for your efforts on this project.
me too
@MahmoudAtef999 thanks, I can reproduce this issue and will investigate further. The expectation should be that row spans are detected correctly here.
On a sidenote, the source of truth is the representation in DoclingDocument (or JSON), which you receive with the export_to_dict() method.
@cau-git Thanks for your response. I've used DoclingDocument to extract the tables and converted them to both CSV and HTML formats. I also tried converting the entire file to JSON. However, the issue persists in both cases. I would appreciate any further guidance or steps I may have missed in troubleshooting.
Hello @cau-git, is there any update?
@MahmoudAtef999 We are in the process of re-training the table model, and your sample will act as a test case. There will be a future release that improves on the accuracy of row-spans.
I will close this issue for now until we have further news or the issue re-appears after the new model is out.
@MahmoudAtef999 Please try with Docling v2.26.0, we updated Table model with new weights and it should address your issue.
Hello @maxmnemonic ,
The issue has been resolved for some tables; however, a large percentage of files are still encountering the same problem.
I have attached a sample file that continues to face this issue.
I am using:
- Python 3.12
- docling 2.28.0
- DoclingParseV4DocumentBackend
Thank you very much for your efforts and support.
Thanks for the examples @MahmoudAtef4499 !
@maxmnemonic Is there any update?