haystack
haystack copied to clipboard
Perform context-aware text splitting
Is your feature request related to a problem? Please describe.
- I want to do retrieval-based QA on a docx document where the document contains text and tables, how do I process the document and split it into chunks and index them? Maybe separate text and tables and process them separately?
- There is a title in the document, such as SectionA. I hope to divide the text of the SectionA part into the same block as much as possible in the process of partitioning, so as to maintain complete semantics in the retrieval process. Is this necessary?
Describe the solution you'd like Maybe the function of performing context-aware text splitting in langchain can meet my needs? But I can't seem to find that feature in heystack. https://python.langchain.com/docs/use_cases/question_answering/how_to/document-context-aware-QA
Describe alternatives you've considered here's what I found in the project related issue: https://github.com/deepset-ai/haystack/discussions/1596 https://github.com/deepset-ai/haystack/discussions/4467
Additional context Add any other context or screenshots about the feature request here.
Hello zh25714, I looked a bit over this issue, and I found this repo https://github.com/allenai/mmda
it works... sometimes... it is still in a very early stage, and there are way too many failures that prevent a document from getting processed. look it this example https://github.com/allenai/mmda/tree/main/examples/vila_for_scidoc_parsing
it is uses https://github.com/allenai/vila/tree/main
Here's a random result I just got... from the demo https://github.com/allenai/vila/tree/main/examples/end2end-sci-pdf-parsing
Hey, while the feature is very useful in general, it is also super hard to tackle in general, since you would need to have a proper document layout understanding engine for all file types.
The langchain example you linked also just works on a MarkdownHeaderTextSplitter, which has the text layout included per design. I think this would be a good first start.
Do you want to rephrase the issue to include a markdown-based context-aware splitter? Would you like to work on such a splitter yourself?
In fact, I only need to do similar parsing in a word document, which contains a consistent structure, similar to markdown. You are right that understanding the layout structure of all document types may require an intelligent engine, which is troublesome.
Hey @zh25714,
I don't know if this is still relevant for you but theres a solution that might fit you.
https://unstructured-io.github.io/unstructured/core/chunking.html#id1
This approch allows splitting HTML documents by headings. It is not context aware but on most websites/documents you can assume that the context is seperated by headings thus receiving contextual chunks
Cheers