haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Perform context-aware text splitting

Open zh25714 opened this issue 1 year ago • 4 comments

Is your feature request related to a problem? Please describe.

  1. I want to do retrieval-based QA on a docx document where the document contains text and tables, how do I process the document and split it into chunks and index them? Maybe separate text and tables and process them separately?
  2. There is a title in the document, such as SectionA. I hope to divide the text of the SectionA part into the same block as much as possible in the process of partitioning, so as to maintain complete semantics in the retrieval process. Is this necessary?

Describe the solution you'd like Maybe the function of performing context-aware text splitting in langchain can meet my needs? But I can't seem to find that feature in heystack. https://python.langchain.com/docs/use_cases/question_answering/how_to/document-context-aware-QA

Describe alternatives you've considered here's what I found in the project related issue: https://github.com/deepset-ai/haystack/discussions/1596 https://github.com/deepset-ai/haystack/discussions/4467

Additional context Add any other context or screenshots about the feature request here.

zh25714 avatar Jul 28 '23 13:07 zh25714

Hello zh25714, I looked a bit over this issue, and I found this repo https://github.com/allenai/mmda

it works... sometimes... it is still in a very early stage, and there are way too many failures that prevent a document from getting processed. look it this example https://github.com/allenai/mmda/tree/main/examples/vila_for_scidoc_parsing

it is uses https://github.com/allenai/vila/tree/main

Here's a random result I just got... from the demo https://github.com/allenai/vila/tree/main/examples/end2end-sci-pdf-parsing image

PAHXO avatar Sep 02 '23 16:09 PAHXO

Hey, while the feature is very useful in general, it is also super hard to tackle in general, since you would need to have a proper document layout understanding engine for all file types.

The langchain example you linked also just works on a MarkdownHeaderTextSplitter, which has the text layout included per design. I think this would be a good first start.

Do you want to rephrase the issue to include a markdown-based context-aware splitter? Would you like to work on such a splitter yourself?

Timoeller avatar Sep 29 '23 10:09 Timoeller

In fact, I only need to do similar parsing in a word document, which contains a consistent structure, similar to markdown. You are right that understanding the layout structure of all document types may require an intelligent engine, which is troublesome.

zh25714 avatar Oct 22 '23 14:10 zh25714

Hey @zh25714,

I don't know if this is still relevant for you but theres a solution that might fit you.

https://unstructured-io.github.io/unstructured/core/chunking.html#id1

This approch allows splitting HTML documents by headings. It is not context aware but on most websites/documents you can assume that the context is seperated by headings thus receiving contextual chunks

Cheers

Julian-AT avatar Apr 12 '24 07:04 Julian-AT