haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: DOCXToDocument: add table extraction

Open vblagoje opened this issue 1 year ago • 4 comments

Why:

Enhances functionality for converting DOCX documents by improving the extraction of document elements, including tables, while maintaining page breaks. This addresses limitations in accurately capturing the structured content of DOCX files for further processing.

  • fixes https://github.com/deepset-ai/haystack/issues/8416

What:

  • Introduced _extract_elements which consolidates the extraction of paragraphs and tables from a DOCX file.
  • Refactored existing methods to support the new extraction logic, allowing for better handling of page breaks and table markdown representation.
  • Updated test cases to validate the correct functionality of document conversion involving tables and ensure meta information is retained accurately.
  • Existing unit tests not modified to ensure everything is kosher as before

How can it be used:

  • The new implementation provides a way to extract both text and tables from DOCX documents efficiently:
docx_converter.run(sources=paths)
  • The extracted content presents both paragraphs and tables formatted in markdown, preserving the original flow and structure of the document:
| This | Is     | Just a |
| ---- | ------ | ------ |
| 2020 | Random | Table  |
  • Markdown text table format is selected because it is the most suitable for LLM table representation (open to other options)

How did you test it:

  • Conducted unit tests to verify the core functionality of the DOCX-to-document conversion mechanism. This included:
    • Validating document content extraction with tables.
    • Checking that all necessary metadata attributes are preserved.
  • Additional tests ensure that extracted content maintains the original order, especially around tables, confirming that text before and after remains intact.

Notes for the reviewer:

  • Focus on the modifications in the extraction logic
  • Check the updated test cases involving mixed content (text and tables) within DOCX files.
  • Review the markdown conversion accuracy, especially for tables

vblagoje avatar Oct 15 '24 12:10 vblagoje

Pull Request Test Coverage Report for Build 11518638027

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 90.59%

Files with Coverage Reduction New Missed Lines %
components/routers/file_type_router.py 1 98.36%
<!-- Total: 1
Totals Coverage Status
Change from base Build 11463116725: 0.1%
Covered Lines: 7615
Relevant Lines: 8406

💛 - Coveralls

coveralls avatar Oct 15 '24 13:10 coveralls

Perhaps not 100% there yet but let's start iterating @sjrl and @medsriha

vblagoje avatar Oct 15 '24 15:10 vblagoje

@medsriha any updates on this? Have you tried it out?

vblagoje avatar Oct 17 '24 12:10 vblagoje

@medsriha any updates on this? Have you tried it out?

Not yet :-( a bit busy with other stuff. Likely to start working on this early next week.

medsriha avatar Oct 17 '24 13:10 medsriha

Ok, thanks a lot @medsriha - let's hear from @sjrl - I read somewhere md table format is a preferred format for table input so I didn't bother with csv, wdyt?

vblagoje avatar Oct 21 '24 07:10 vblagoje

Ok, thanks a lot @medsriha - let's hear from @sjrl - I read somewhere md table format is a preferred format for table input so I didn't bother with csv, wdyt?

@vblagoje I think we should make it configurable so let the user choose between md and csv. We have found that LLMs can work well with both with maybe a bit more consistency on csv since there are many different md format versions and not all md versions appear to work well.

sjrl avatar Oct 21 '24 07:10 sjrl

Ok, deal @sjrl I'll add option to create table as csv, add unit tests and ping you for the final review 🙏

vblagoje avatar Oct 21 '24 08:10 vblagoje

@sjrl @medsriha this one should be ready to go now with both csv and markdown table output support configurable via init parameter. LMK your thoughts.

vblagoje avatar Oct 21 '24 09:10 vblagoje

@sjrl @shadeMe I rolled back to previous commit and then added the last two commits.

vblagoje avatar Oct 24 '24 07:10 vblagoje

@vblagoje Please don't force-push once reviews have been published - it breaks the reviewer's ability to diff b'ween commits since their last review.

shadeMe avatar Oct 24 '24 09:10 shadeMe