langchain
langchain copied to clipboard
What is the best size for a chunk
Issue you'd like to raise.
I used RecursiveCharacterTextSplitter.from_tiktoken_encoder to split a document, and if I set chunk_size to 2000, OpenAI cannot answer my question by the documents, if I set chunk_size to 500, OpenAI can work very well. I want to know, As a rule of thumb, what is the best size for a chunk
Suggestion:
No response
no straight forward answer, trial and error, can suggest few point to look to decide
- check the splits post loading
vector_store.similarity_search_with_score()make sure it makes sense to your question - experiment with
kandscore_threshold - try to change to different splitter, make sure your splits are correct
- it all of it does not work, meaning you data has parse answer, either do some post processing on data, clean up, add some manual text splitting characters, etc
- try the compressor techniques to get crucks of data if required
Hi, @MIMI180306! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, you were asking for guidance on determining the optimal chunk size when using OpenAI's RecursiveCharacterTextSplitter. You found that a chunk size of 500 worked well, but you were looking for a general rule of thumb. SDcodehub suggested trying different splits, experimenting with k and score_threshold, and considering post-processing or data cleaning if the current approach doesn't work. They also mentioned using compressor techniques if needed.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!