Support late chunking

Open luonist opened this issue 10 months ago • 1 comments

Feature request

It would be nice to have late chunking supported by the library, optionally activated by a parameter passed in the request. This feature is available, for example, with the Jina AI API, via the parameter "late_chunking": True.

Motivation

Late chunking allows an efficient usage of large context embedding models, by producing chunks that preserve the context of a larger section of a document.

Your contribution

I haven't got a good understanding of the whole code yet, but I could work on a PR, after agreeing on a suitable approach.

Mar 08 '25 09:03 luonist

I was initially thinking to pass a parameter late_chunking with the embedding request, but since such a parameter does not exist in the OpenAI specification, I would instead add an engine arg for enabling this feature. This way, the behaviour of the embedding model would be fixed (late chunking will be always applied), but that should most likely be the preferred behaviour. The worst case scenario would be to deploy a model twice, with and without the late chunking implemented. I also assume that late chunking would be implemented for text-embeddings only, while the engines supported are another issue, as I can do the implementation for Torch (SentenceTransformers), but I'm not sure about Optimum (ONNX).

Mar 11 '25 09:03 luonist