haystack feat: Enhance DocumentSplitter to support semantic document splitting

Is your feature request related to a problem? Please describe. Currently the DocumentSplitter in Haystack is relatively basic and recently we have seen that semantic splitting has greatly gained in popularity.

For example, see Partitioning and Chunking in Unstructured.

Or another example is the https://github.com/segment-any-text/wtpsplit package which shows great results for sentence splitting across many languages. This could be used to greatly improve our current sentence splitting in the DocumentSplitter which just uses a period symbol for sentence splits.

Describe the solution you'd like It would be great to enhance Haystack's splitting/chunking strategies to use these new types of methods which have shown to boost the quality of RAG applications.

** Additional Context ** I think doing some research and finding popular libraries (e.g. from the Haystack Discord) would also be a good way to find a good place to start.

Jul 29 '24 07:07 sjrl

@sjrl this task landed in my sprint. How about we implement https://x.com/JinaAI_/status/1826649439324254291 I've seen a lot of buzz on x about it and it seems relatively straightforward to implement. LMK

Sep 09 '24 08:09 vblagoje

Hey @vblagoje that approach certainly looks interesting. There doesn't seem to be a standard implementation for it yet so I wonder if something like that deserves to be it's own separate splitter component.

Also FYI I did migrate the sentence splitting from v1 into v2 using the NLTK package in this custom component here. So I think as part of this ticket it could be good to bring this feature to Haystack to improve our sentence splitting capabilities.

Sep 09 '24 09:09 sjrl

Ok, so what you are suggesting @sjrl is to implement DeepsetDocumentSplitter for this ticket and then enhanced semantic paragraph splitting via JinaAI in another issue?

Sep 09 '24 09:09 vblagoje

Ok, so what you are suggesting @sjrl is to implement DeepsetDocumentSplitter for this ticket and then enhanced semantic paragraph splitting via JinaAI in another issue?

I think the DeepsetDocumentSplitter should take prio, but up to you if you think the enhanced semantic splitting should be done in a different issue/different time.

Sep 09 '24 09:09 sjrl

Yes, I agree DeepsetDocumentSplitter first. The other next sprint or so.

Sep 09 '24 09:09 vblagoje

I'll leave this open because we haven't actually done semantic splitting but have completed https://github.com/deepset-ai/haystack/pull/8350 instead

cc @julian-risch

Sep 17 '24 12:09 vblagoje

Both Langchain & llamaindex have a semantic splitter using an embedding (https://blog.lancedb.com/chunking-techniques-with-langchain-and-llamaindex/). From the code (https://github.com/langchain-ai/langchain-experimental/blob/main/libs/experimental/langchain_experimental/text_splitter.py), I see they split by sentence then get the embeddings from it and calculate de cosine distance between each sentence and the next one. If the distance is low, we put the 2 sentences in the same chunk.

Nov 28 '24 13:11 paulmartrencharpro

What's the current status of this issue?

It started with a proposal from @sjrl to enhance Haystack with more robust chunking methods (some of these methods are linked at the beginning of the issue)
It then diverged into adding a component based on the NLTK sentence tokenization algorithm - already done here
I then reviewed and added a few more chunking methods, and now here's this PR for the RecursiveChunking

This issue seems to be more of a reference to different chunking methods than a specific request. I think it would be best to keep track of these methods on a Notion page and close this issue.

What do you think @sjrl @julian-risch ?

Jan 03 '25 16:01 davidsbatista

@davidsbatista yes closing this sounds good to me!

Jan 07 '25 07:01 sjrl