feat: Enhance DocumentSplitter to support semantic document splitting
Is your feature request related to a problem? Please describe.
Currently the DocumentSplitter in Haystack is relatively basic and recently we have seen that semantic splitting has greatly gained in popularity.
For example, see Partitioning and Chunking in Unstructured.
Or another example is the https://github.com/segment-any-text/wtpsplit package which shows great results for sentence splitting across many languages. This could be used to greatly improve our current sentence splitting in the DocumentSplitter which just uses a period symbol for sentence splits.
Describe the solution you'd like It would be great to enhance Haystack's splitting/chunking strategies to use these new types of methods which have shown to boost the quality of RAG applications.
** Additional Context ** I think doing some research and finding popular libraries (e.g. from the Haystack Discord) would also be a good way to find a good place to start.
@sjrl this task landed in my sprint. How about we implement https://x.com/JinaAI_/status/1826649439324254291 I've seen a lot of buzz on x about it and it seems relatively straightforward to implement. LMK
Hey @vblagoje that approach certainly looks interesting. There doesn't seem to be a standard implementation for it yet so I wonder if something like that deserves to be it's own separate splitter component.
Also FYI I did migrate the sentence splitting from v1 into v2 using the NLTK package in this custom component here. So I think as part of this ticket it could be good to bring this feature to Haystack to improve our sentence splitting capabilities.
Ok, so what you are suggesting @sjrl is to implement DeepsetDocumentSplitter for this ticket and then enhanced semantic paragraph splitting via JinaAI in another issue?
Ok, so what you are suggesting @sjrl is to implement DeepsetDocumentSplitter for this ticket and then enhanced semantic paragraph splitting via JinaAI in another issue?
I think the DeepsetDocumentSplitter should take prio, but up to you if you think the enhanced semantic splitting should be done in a different issue/different time.
Yes, I agree DeepsetDocumentSplitter first. The other next sprint or so.
I'll leave this open because we haven't actually done semantic splitting but have completed https://github.com/deepset-ai/haystack/pull/8350 instead
cc @julian-risch
Both Langchain & llamaindex have a semantic splitter using an embedding (https://blog.lancedb.com/chunking-techniques-with-langchain-and-llamaindex/). From the code (https://github.com/langchain-ai/langchain-experimental/blob/main/libs/experimental/langchain_experimental/text_splitter.py), I see they split by sentence then get the embeddings from it and calculate de cosine distance between each sentence and the next one. If the distance is low, we put the 2 sentences in the same chunk.
What's the current status of this issue?
- It started with a proposal from @sjrl to enhance Haystack with more robust chunking methods (some of these methods are linked at the beginning of the issue)
- It then diverged into adding a component based on the NLTK sentence tokenization algorithm - already done here
- I then reviewed and added a few more chunking methods, and now here's this PR for the RecursiveChunking
This issue seems to be more of a reference to different chunking methods than a specific request. I think it would be best to keep track of these methods on a Notion page and close this issue.
What do you think @sjrl @julian-risch ?
@davidsbatista yes closing this sounds good to me!