haystack Add split_length by token in preprocessor

Is your feature request related to a problem? Please describe. With LLM like chatgpt, its measurement is token, not word. In order to maximize the embedding etc, it is best to have split by token feature

Describe the solution you'd like Add another split choice token, and measure the chunk by token size.

Other packages like langchain, llama-index already has this feature.

May 22 '23 18:05 yudataguy

Hi @yudataguy that's an interesting idea for an additional feature! Our core engineering team won't be able to work on it in the next two sprints but maybe you yourself or somebody else from the community wants to work on it and open a PR? That would be more than welcome! 🙂 Here are our contributor guidelines.

May 24 '23 14:05 julian-risch

Hi @julian-risch. Created a PR for this. Would be great if someone could take a look!

Jul 05 '23 08:07 benheckmann

done in #5276

Feb 16 '24 16:02 anakin87