Add split_length by token in preprocessor
Is your feature request related to a problem? Please describe. With LLM like chatgpt, its measurement is token, not word. In order to maximize the embedding etc, it is best to have split by token feature
Describe the solution you'd like
Add another split choice token, and measure the chunk by token size.
Other packages like langchain, llama-index already has this feature.
Hi @yudataguy that's an interesting idea for an additional feature! Our core engineering team won't be able to work on it in the next two sprints but maybe you yourself or somebody else from the community wants to work on it and open a PR? That would be more than welcome! 🙂 Here are our contributor guidelines.
Hi @julian-risch. Created a PR for this. Would be great if someone could take a look!
done in #5276