gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

Provide new chunking strategies in localdocs

Open manyoso opened this issue 1 year ago • 4 comments

Currently we do a character/word based chunking that is very simple. We should enhance our chunking strategies to possibly include:

  • Recursive Character Chunking
  • Token Based Chunking
  • Document Specific Chunking (HTML, MD, Python, CPP, etc)
  • Semantic Chunking

Here is some possible literature:

  • https://research.trychroma.com/evaluating-chunking
  • https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag
  • https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d

manyoso avatar Jul 10 '24 14:07 manyoso