haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Add split_length by token in preprocessor

Open yudataguy opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe. With LLM like chatgpt, its measurement is token, not word. In order to maximize the embedding etc, it is best to have split by token feature

Describe the solution you'd like Add another split choice token, and measure the chunk by token size.

Other packages like langchain, llama-index already has this feature.

yudataguy avatar May 22 '23 18:05 yudataguy

Hi @yudataguy that's an interesting idea for an additional feature! Our core engineering team won't be able to work on it in the next two sprints but maybe you yourself or somebody else from the community wants to work on it and open a PR? That would be more than welcome! 🙂 Here are our contributor guidelines.

julian-risch avatar May 24 '23 14:05 julian-risch

Hi @julian-risch. Created a PR for this. Would be great if someone could take a look!

benheckmann avatar Jul 05 '23 08:07 benheckmann

done in #5276

anakin87 avatar Feb 16 '24 16:02 anakin87