haystack feat: DOCXToDocument: add table extraction

Why:

Enhances functionality for converting DOCX documents by improving the extraction of document elements, including tables, while maintaining page breaks. This addresses limitations in accurately capturing the structured content of DOCX files for further processing.

fixes https://github.com/deepset-ai/haystack/issues/8416

What:

Introduced _extract_elements which consolidates the extraction of paragraphs and tables from a DOCX file.
Refactored existing methods to support the new extraction logic, allowing for better handling of page breaks and table markdown representation.
Updated test cases to validate the correct functionality of document conversion involving tables and ensure meta information is retained accurately.
Existing unit tests not modified to ensure everything is kosher as before

How can it be used:

The new implementation provides a way to extract both text and tables from DOCX documents efficiently:

docx_converter.run(sources=paths)

The extracted content presents both paragraphs and tables formatted in markdown, preserving the original flow and structure of the document:

| This | Is     | Just a |
| ---- | ------ | ------ |
| 2020 | Random | Table  |

Markdown text table format is selected because it is the most suitable for LLM table representation (open to other options)

How did you test it:

Conducted unit tests to verify the core functionality of the DOCX-to-document conversion mechanism. This included:
- Validating document content extraction with tables.
- Checking that all necessary metadata attributes are preserved.
Additional tests ensure that extracted content maintains the original order, especially around tables, confirming that text before and after remains intact.

Notes for the reviewer:

Focus on the modifications in the extraction logic
Check the updated test cases involving mixed content (text and tables) within DOCX files.
Review the markdown conversion accuracy, especially for tables

Oct 15 '24 12:10 vblagoje

Pull Request Test Coverage Report for Build 11518638027

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.1%) to 90.59%

Files with Coverage Reduction	New Missed Lines	%
components/routers/file_type_router.py	1	98.36%
<!--	Total:	1

Totals
Change from base Build 11463116725:	0.1%
Covered Lines:	7615
Relevant Lines:	8406

💛 - Coveralls

Oct 15 '24 13:10 coveralls

Perhaps not 100% there yet but let's start iterating @sjrl and @medsriha

Oct 15 '24 15:10 vblagoje

@medsriha any updates on this? Have you tried it out?

Oct 17 '24 12:10 vblagoje

@medsriha any updates on this? Have you tried it out?

Not yet :-( a bit busy with other stuff. Likely to start working on this early next week.

Oct 17 '24 13:10 medsriha

Ok, thanks a lot @medsriha - let's hear from @sjrl - I read somewhere md table format is a preferred format for table input so I didn't bother with csv, wdyt?

Oct 21 '24 07:10 vblagoje

Ok, thanks a lot @medsriha - let's hear from @sjrl - I read somewhere md table format is a preferred format for table input so I didn't bother with csv, wdyt?

@vblagoje I think we should make it configurable so let the user choose between md and csv. We have found that LLMs can work well with both with maybe a bit more consistency on csv since there are many different md format versions and not all md versions appear to work well.

Oct 21 '24 07:10 sjrl

Ok, deal @sjrl I'll add option to create table as csv, add unit tests and ping you for the final review 🙏

Oct 21 '24 08:10 vblagoje

@sjrl @medsriha this one should be ready to go now with both csv and markdown table output support configurable via init parameter. LMK your thoughts.

Oct 21 '24 09:10 vblagoje

@sjrl @shadeMe I rolled back to previous commit and then added the last two commits.

Oct 24 '24 07:10 vblagoje

@vblagoje Please don't force-push once reviews have been published - it breaks the reviewer's ability to diff b'ween commits since their last review.

Oct 24 '24 09:10 shadeMe