feat: DOCXToDocument: add table extraction
Why:
Enhances functionality for converting DOCX documents by improving the extraction of document elements, including tables, while maintaining page breaks. This addresses limitations in accurately capturing the structured content of DOCX files for further processing.
- fixes https://github.com/deepset-ai/haystack/issues/8416
What:
- Introduced
_extract_elementswhich consolidates the extraction of paragraphs and tables from a DOCX file. - Refactored existing methods to support the new extraction logic, allowing for better handling of page breaks and table markdown representation.
- Updated test cases to validate the correct functionality of document conversion involving tables and ensure meta information is retained accurately.
- Existing unit tests not modified to ensure everything is kosher as before
How can it be used:
- The new implementation provides a way to extract both text and tables from DOCX documents efficiently:
docx_converter.run(sources=paths)
- The extracted content presents both paragraphs and tables formatted in markdown, preserving the original flow and structure of the document:
| This | Is | Just a |
| ---- | ------ | ------ |
| 2020 | Random | Table |
- Markdown text table format is selected because it is the most suitable for LLM table representation (open to other options)
How did you test it:
- Conducted unit tests to verify the core functionality of the DOCX-to-document conversion mechanism. This included:
- Validating document content extraction with tables.
- Checking that all necessary metadata attributes are preserved.
- Additional tests ensure that extracted content maintains the original order, especially around tables, confirming that text before and after remains intact.
Notes for the reviewer:
- Focus on the modifications in the extraction logic
- Check the updated test cases involving mixed content (text and tables) within DOCX files.
- Review the markdown conversion accuracy, especially for tables
Pull Request Test Coverage Report for Build 11518638027
Warning: This coverage report may be inaccurate.
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
- For more information on this, see Tracking coverage changes with pull request builds.
- To avoid this issue with future PRs, see these Recommended CI Configurations.
- For a quick fix, rebase this PR at GitHub. Your next report should be accurate.
Details
- 0 of 0 changed or added relevant lines in 0 files are covered.
- 1 unchanged line in 1 file lost coverage.
- Overall coverage increased (+0.1%) to 90.59%
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| components/routers/file_type_router.py | 1 | 98.36% |
| <!-- | Total: | 1 |
| Totals | |
|---|---|
| Change from base Build 11463116725: | 0.1% |
| Covered Lines: | 7615 |
| Relevant Lines: | 8406 |
💛 - Coveralls
Perhaps not 100% there yet but let's start iterating @sjrl and @medsriha
@medsriha any updates on this? Have you tried it out?
@medsriha any updates on this? Have you tried it out?
Not yet :-( a bit busy with other stuff. Likely to start working on this early next week.
Ok, thanks a lot @medsriha - let's hear from @sjrl - I read somewhere md table format is a preferred format for table input so I didn't bother with csv, wdyt?
Ok, thanks a lot @medsriha - let's hear from @sjrl - I read somewhere md table format is a preferred format for table input so I didn't bother with csv, wdyt?
@vblagoje I think we should make it configurable so let the user choose between md and csv. We have found that LLMs can work well with both with maybe a bit more consistency on csv since there are many different md format versions and not all md versions appear to work well.
Ok, deal @sjrl I'll add option to create table as csv, add unit tests and ping you for the final review 🙏
@sjrl @medsriha this one should be ready to go now with both csv and markdown table output support configurable via init parameter. LMK your thoughts.
@sjrl @shadeMe I rolled back to previous commit and then added the last two commits.
@vblagoje Please don't force-push once reviews have been published - it breaks the reviewer's ability to diff b'ween commits since their last review.