Verba
Verba copied to clipboard
No relevant chunks when attempting to chat with uploaded PDFs
Description
I am unable to retrieve many (any?) relevant chunks when attempting to chat with uploaded PDFs.
I have imported about 20 PDFs, each are roughly 20 pages of text. The importing goes well and I can see the text has been extracted properly. However, when I prompt the chat it retrieves chunks from a few different documents but the context that is pulled from the documents is not very relevant to the prompt. Note, that I believe these are fairly basic prompts that I would expect it to be able to search easily (but maybe I am naive). Is there anything that should be tweaked in the codebase, especially with regards to the Embedder or Retriever.
If this is simply a function of try different settings/configs please let me know as I am new to RAG 🙏 , but it feels like it should work at least a little more reliably than this for some basic prompts. Thank you in advance.
Things I have tried:
- changing chunk size to 512 and overlap to 100
- changing chunk size to 250 and overlap to 50
- hardcoding the OllamaGenerator context_window size in the repo from 10000 -> 100000
Is this a bug or a feature?
- [ ] Bug
- [ ] Feature
Steps to Reproduce
Basic setup using pip install or repo clone.
.env -> OLLAMA_URL=http://localhost:11434 OLLAMA_MODEL=llama3.1:latest OLLAMA_EMBED_MODEL=mxbai-embed-large:latest
(^ Note that these all appear correctly imported in the OVERVIEW)
Import ~20 PDFs that contain ~20 pages of text each.
Ask a question in chat about the documents.
Additional context
Screenshot of chat example: (Note there are probably a dozen references throughout the PDFs to "310 Second Street")