quivr
quivr copied to clipboard
Debugging a critical failure
Before building quivr out further, I would like to look into this problem: https://twitter.com/bitfalls/status/1657736877007405056?s=20
Specifically, I fed it around 30 markdown documents, but it looks like it does not really follow context and has no real idea what is going on. Every article has a "status" at the top, and when I asked it to give me all active ones, it just gives me random ones, while asking for the status of one I know is active it tells me it is active. This shows me that it can check for status of a targeted ask, but cannot reason on the whole corpus of information. I wonder why this happens and if we can debug it, because unless this is resolved, Quivr is dangerously unreliable to use.
@StanGirard any ideas on what to do to test this and come to a satisfactory output?
I guess there are two potential issues
- related to the prompt, which somehow it wasn't able to locate all correct embeddings of docs for that prompt.
- related to document split, the document truck doesn't have all the information it requires to answer the question.
I think one thing we could do is have a debug mode, which shows what documents have been included and what have been included to answer the question. It would be super helpful to debug :D
@Swader thx for submitting the issue 👍
Hey @Swader, Thanks for the issue.
The issue might come from an omission on my part. When I created Quivr I didn't update the default settings of the VectorStore retriever.
What it means ? If you have 20 files with the information it might only get the top 5.
We need to implement a way of increasing the retrievers capabilities but we face some issues:
- The users can choose its chunk size meaning if we set the retriever to 100 chunks we can go over the max token number
- The default chunk size is probably suboptimal for a second brain.
More info on retrievers here https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/vectorstore-retriever.html
I see. So what is the best approach to handle debugging this? I am hesitant to integrate Quivr any more deeply into my workflow until I can be sure it is aware of the whole context. Would it help if I generated a mock BD database similar to mine, but with fake companies and statuses, which we could evaluate things on?
We need to implement a way of increasing the retrievers capabilities but we face some issues:
- The users can choose its chunk size meaning if we set the retriever to 100 chunks we can go over the max token number
- The default chunk size is probably suboptimal for a second brain.
How about solving this with Whisper.cpp or LLaMA.cpp models?
@StanGirard how do we re-test for full context awareness?
It has been another month @StanGirard let's get this fully end to end tested?
We have started implemented tests with Pytest and Vitest for Frontend and backend :)
Feel free to add your own tests
How can I use tests to make sure it knows the knowledge I expect it to know? Any simple guide? Would love to contribute with cases.
Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.
It is stale, but remains unresolved.
Too many things to do 😁
Seem critical though? End to end tests (generate an example DB using chatgpt then run E2E tests on it that have to pass on every release) could be pretty valuable, no? Seems like this is failing at the core functionality of quivr, so I feel like it should be addressed and want to help, esp. now that OpenAI is also pushing trainable data.
You could try using langsmith. They have some pretty cool features.
On Wed, Aug 23 2023 at 12:01 PM, Bruno Škvorc < @.*** > wrote:
Seem critical though? End to end tests (generate an example DB using chatgpt then run E2E tests on it that have to pass on every release) could be pretty valuable, no? Seems like this is failing at the core functionality of quivr, so I feel like it should be addressed and want to help, esp. now that OpenAI is also pushing trainable data.
— Reply to this email directly, view it on GitHub ( https://github.com/StanGirard/quivr/issues/41#issuecomment-1689666781 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AEVUW3HF6N7ESNLGBDEH2NLXWXIHVANCNFSM6AAAAAAYEQUNAY ). You are receiving this because you were mentioned. Message ID: <StanGirard/quivr/issues/41/1689666781 @ github. com>
For what exactly, what do you mean? For the testing? I do not think it is necessary, I think just generating a mock DB that we can test on, and then run simple JS tests for input/output would be quite adequate.
HI @Swader, thanks for the suggestion & your interest in Quivr. We're working on a UX change and a backend refactoring at the moment, so we haven't been able to prioritize e2e tests. Feel free to create an issue and a PR for the e2e tests, we will follow it closely, it would help tremendously !!
Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.