langchain icon indicating copy to clipboard operation
langchain copied to clipboard

What is the best size for a chunk

Open MIMI180306 opened this issue 2 years ago • 1 comments

Issue you'd like to raise.

I used RecursiveCharacterTextSplitter.from_tiktoken_encoder to split a document, and if I set chunk_size to 2000, OpenAI cannot answer my question by the documents, if I set chunk_size to 500, OpenAI can work very well. I want to know, As a rule of thumb, what is the best size for a chunk

Suggestion:

No response

MIMI180306 avatar Jul 07 '23 05:07 MIMI180306

no straight forward answer, trial and error, can suggest few point to look to decide

  1. check the splits post loading vector_store.similarity_search_with_score() make sure it makes sense to your question
  2. experiment with k and score_threshold
  3. try to change to different splitter, make sure your splits are correct
  4. it all of it does not work, meaning you data has parse answer, either do some post processing on data, clean up, add some manual text splitting characters, etc
  5. try the compressor techniques to get crucks of data if required

SDcodehub avatar Jul 07 '23 09:07 SDcodehub

Hi, @MIMI180306! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were asking for guidance on determining the optimal chunk size when using OpenAI's RecursiveCharacterTextSplitter. You found that a chunk size of 500 worked well, but you were looking for a general rule of thumb. SDcodehub suggested trying different splits, experimenting with k and score_threshold, and considering post-processing or data cleaning if the current approach doesn't work. They also mentioned using compressor techniques if needed.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Oct 06 '23 16:10 dosubot[bot]