langchain What is the best size for a chunk

Issue you'd like to raise.

I used RecursiveCharacterTextSplitter.from_tiktoken_encoder to split a document, and if I set chunk_size to 2000, OpenAI cannot answer my question by the documents, if I set chunk_size to 500, OpenAI can work very well. I want to know, As a rule of thumb, what is the best size for a chunk

Suggestion:

No response

Jul 07 '23 05:07 MIMI180306

no straight forward answer, trial and error, can suggest few point to look to decide

check the splits post loading vector_store.similarity_search_with_score() make sure it makes sense to your question
experiment with k and score_threshold
try to change to different splitter, make sure your splits are correct
it all of it does not work, meaning you data has parse answer, either do some post processing on data, clean up, add some manual text splitting characters, etc
try the compressor techniques to get crucks of data if required

Jul 07 '23 09:07 SDcodehub

Hi, @MIMI180306! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were asking for guidance on determining the optimal chunk size when using OpenAI's RecursiveCharacterTextSplitter. You found that a chunk size of 500 worked well, but you were looking for a general rule of thumb. SDcodehub suggested trying different splits, experimenting with k and score_threshold, and considering post-processing or data cleaning if the current approach doesn't work. They also mentioned using compressor techniques if needed.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

Oct 06 '23 16:10 dosubot[bot]

langchain langchain copied to clipboard

What is the best size for a chunk

Issue you'd like to raise.

Suggestion:

langchain
langchain copied to clipboard