Verba icon indicating copy to clipboard operation
Verba copied to clipboard

No relevant chunks when attempting to chat with uploaded PDFs

Open mike-mathieu opened this issue 6 months ago • 0 comments

Description

I am unable to retrieve many (any?) relevant chunks when attempting to chat with uploaded PDFs.

I have imported about 20 PDFs, each are roughly 20 pages of text. The importing goes well and I can see the text has been extracted properly. However, when I prompt the chat it retrieves chunks from a few different documents but the context that is pulled from the documents is not very relevant to the prompt. Note, that I believe these are fairly basic prompts that I would expect it to be able to search easily (but maybe I am naive). Is there anything that should be tweaked in the codebase, especially with regards to the Embedder or Retriever.

If this is simply a function of try different settings/configs please let me know as I am new to RAG 🙏 , but it feels like it should work at least a little more reliably than this for some basic prompts. Thank you in advance.

Things I have tried:

  • changing chunk size to 512 and overlap to 100
  • changing chunk size to 250 and overlap to 50
  • hardcoding the OllamaGenerator context_window size in the repo from 10000 -> 100000

Is this a bug or a feature?

  • [ ] Bug
  • [ ] Feature

Steps to Reproduce

Basic setup using pip install or repo clone.

.env -> OLLAMA_URL=http://localhost:11434 OLLAMA_MODEL=llama3.1:latest OLLAMA_EMBED_MODEL=mxbai-embed-large:latest

(^ Note that these all appear correctly imported in the OVERVIEW)

Import ~20 PDFs that contain ~20 pages of text each.

Ask a question in chat about the documents.

Additional context

Screenshot of chat example: (Note there are probably a dozen references throughout the PDFs to "310 Second Street")

Screenshot 2024-08-14 at 11 25 06 PM

mike-mathieu avatar Aug 15 '24 03:08 mike-mathieu