private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Implement chunking

Open shinjanc opened this issue 1 year ago • 6 comments

Question

I want to ingest 150-200 files of 15-20 pages and want to query them and want answers to be generated from multiple files. Presently it is just quoting 2 sources. Is chunking the way out? How to implement it and code for the same please

shinjanc avatar Dec 17 '24 08:12 shinjanc

What do you want to do with chunking? Right now, chunking is by sentence, so each document should generate N chunks that it should be interoperable. If no enough chunks are being retrieving, you should increase top similarity_top_k in ChatService.

jaluma avatar Jan 07 '25 08:01 jaluma

I have already updated all parameters so that privateGPT quotes at least 10 sources. Currently, I have implemeted Chunking: sentence chunking. Llm: llama 3.1 8b Embedding:nomic embed text Vector storage : qdrant Context window : 32000

I am still getting 4 sources. I want to maximise my sources. How do I achieve this? I have tried multiple things, asked in discord community, but no-one has helped

shinjanc avatar Jan 07 '25 16:01 shinjanc

I have already updated all parameters so that privateGPT quotes at least 10 sources. Currently, I have implemeted Chunking: sentence chunking. Llm: llama 3.1 8b Embedding:nomic embed text Vector storage : qdrant Context window : 32000

I am still getting 4 sources. I want to maximise my sources. How do I achieve this? I have tried multiple things, asked in discord community, but no-one has helped

Yahoo Mail: Search, Organize, Conquer

On Tue, Jan 7, 2025 at 13:59, Javier @.***> wrote:

What do you want to do with chunking? Right now, chunking is by sentence, so each document should generate N chunks that it should be interoperable. If no enough chunks are being retrieving, you should increase top similarity_top_k in ChatService.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

shinjanc avatar Jan 07 '25 16:01 shinjanc

LLM or context window won't change anything in search part. Can you change or comment similarity value to check it? You can change in settings.yaml. If it doesn't work, you should change you embedding model to another one, with more spatial capacities

jaluma avatar Jan 08 '25 09:01 jaluma

The similarity value is already disabled.

shinjanc avatar Jan 08 '25 09:01 shinjanc

this could be a possible way:

  1. Fine-Tune Chunking: Currently, you are using sentence-level chunking. This might not provide enough distinct chunks for retrieval. Switch to paragraph-based chunking or a hybrid approach where sentences are grouped into small paragraphs (e.g., 3-5 sentences per chunk).

  2. Adjust Similarity Retrieval: Increase similarity_top_k in your ChatService settings to ensure more chunks are retrieved. Example: If it’s currently set to 10, try increasing it to 15 or 20.

3.Embed Smaller Granular Chunks: Shorter chunks may lead to more precise embeddings, improving the diversity of retrieved sources. However, avoid chunks that are too small, as they might lose context.

4.Enhance Embedding Model: If the embedding model (nomic embed text) is not capturing sufficient semantic relationships, switch to a model with higher spatial representation capabilities, like OpenAI’s text-embedding-ada-002 or Cohere embeddings.

5.Combine Embeddings Across Documents: Implement logic to ensure embeddings from all documents are queried together by augmenting query pipelines to pull from multiple documents explicitly.

vinodsuresh95 avatar Jan 14 '25 07:01 vinodsuresh95