gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

GPT4All not using local documents

Open Rockavik opened this issue 10 months ago • 25 comments

System Info

GPT4ALL v2.4.19 Anaconda3 Python 3.10.9

Information

  • [X] The official example notebooks/scripts
  • [ ] My own modified scripts

Related Components

  • [ ] backend
  • [ ] bindings
  • [ ] python-bindings
  • [X] chat-ui
  • [ ] models
  • [ ] circleci
  • [ ] docker
  • [ ] api

Reproduction

Installed GPT4ALL Downloaded GPT4ALL Falcon Set up directory folder called Local_Docs Created CharacterProfile.txt with information regarding a character. Saved in Local_Docs Folder In GPT4All, clicked on settings>plugins>LocalDocs Plugin Added folder path Created collection name Local_Docs Clicked Add Clicked collections icon on main screen next to wifi icon. Ticked Local_Docs Talked to GPT4ALL about material in Local_docs GPT4ALL does not respond with any material or reference to what's in the Local_Docs>CharacterProfile.txt

Expected behavior

GPT4All should respond with references of the information that is inside the Local_Docs> Characterprofile.txt file.

Rockavik avatar Sep 23 '23 12:09 Rockavik

I am having the same issue

Frub3L avatar Sep 25 '23 00:09 Frub3L

Same here. From the documentation: Enabling LocalDocs

Install the latest version of GPT4All Chat from [GPT4All Website](https://gpt4all.io/).
Go to Settings > LocalDocs tab.
Configure a collection (folder) on your computer that contains the files your LLM should have access to. You can alter the contents of the folder/directory at anytime. As you add more files to your collection, your LLM will dynamically be able to access them.
Spin up a chat session with any LLM (including external ones like ChatGPT but warning data will leave your machine!)
At the top right, click the database icon and select which collection you want your LLM to know about during your chat session.

That's what I did, but it won't answer with the knowledge I gave it (it generally answers, but I don't know if it hallucinates those answers or what, they're just not what's in LocalDocs at all). What am I missing?

Yemet1001 avatar Sep 28 '23 16:09 Yemet1001

Same here. I tried mac osx and win :-(

demsarinic avatar Sep 30 '23 19:09 demsarinic

Same here.. win ver. At least i get one random answer from my docs one time, then, gpt4all stop using the documents..

amichelis avatar Oct 02 '23 12:10 amichelis

gpt4all 2.4.19 could be better, but works for me.

Used the Mini Orca (small) language model. Had two documents in my LocalDocs. The first document was my curriculum vitae. The second document was a job offer. The model was able to use text from these documents as context and write a cover letter for a job application. This was the first time I tried this model and I am not an expert at using the proper syntax, so it took me two hours (!) conversing for it to finally come up with a cover letter that could be a crude template for my real cover letter. I found that my instructions had to be very precise, otherwise it would only give very generic responses that had nothing to do with what I asked it to and very often came up with excuses asking me for more context and details. It helped to rhetorically "attach" the CV to the sender (applicant) and the job offer to the receiver (company).

Here my final question (real names redacted):

"Write a cover letter one DINA4 page long in the name of sender to receiver. This cover letter is part of a job application for the position of XXX. For the cover letter refer to experiences mentioned in sender's CV. Keep in mind YYY."

ThiloteE avatar Oct 02 '23 16:10 ThiloteE

I had this issue on MacOS. Ultimately I uninstalled the program (using AppCleaner) and removed the files in /Library/Application Support/nomic.ai (but not the model files) to give me a clean slate. Now it's working again.

PLK2 avatar Oct 04 '23 18:10 PLK2

I am seeing this as well, but am unsure if the indexing takes time, or is silent. Is there a way to test this by eliminating all knowledge other than local docs (sort of a crude RAG)?

jafrank88 avatar Oct 18 '23 23:10 jafrank88

I experimented some more:

From a user point of view, these are the things that help at least a tiny little bit:

  1. Use model with large context size, as those can handle more input tokens. The more documents you have in LocalDocs, the larger the context that will be added to your prompt and every model has a maximum context size they are trained for. One such model would be LlongOrca-7b-16k, which can process a context size of over 16 000 tokens, whereas more standard models only can process a context size of 4096. See https://www.techtarget.com/whatis/definition/context-window to find out more about context window size. Edit: Unfortunately, for various reasons, gpt4all limits the context size to 2048 in its current version (v. 2.5.2).
  2. Maybe add sentences like this into the system prompt: "Use the provided context when answering questions"
  3. As always, have a great prompt. E.g. use step-back-prompting technique, chain-of-thought prompting technique and so on. The wording of your prompt is very important.
  4. Include words and snippets from your local documents to make it more likely this particular context will be prioritized. E.g. if you have a .txt document with character info about Tom, include the name Tom in your prompt.
  5. Increase counter for "Document snippets per prompt" and "Document snippet size (Characters)" under LocalDocs advanced settings. Edit: The document snippet size depends on the maximum the embedding model is designed for, so it is much better to simply increase the number of snippets per prompt, as there is less risk to choose the wrong value. In 2.5.4, you cannot go above the hardcoded max context size of the non-embedding model (which is currently 2048), as that will severely decrease quality of responses (and trigger an error message in GPT4All). Edit²: roughly speaking, Max Context Size = User prompt + Response by Model + Snippet Size * Number of Snippets or something like that. Just do a little bit of experimentation and if you hit bad output or exceed max context size, then go a little bit lower. While you can go a little higher, the default settings are a good start.
  6. Finally, in 2.5.0 pre-release1, the index apparently only gets created once and that is, when you add the collection in the preferences. So, avoid adding or deleting a file from the collection folder afterwards, as the index will not automatically be updated. At least, that's my suspicion after having done some trial and error. Maybe this is fixed in newer versions, but If you want to make sure to have the correct index, while adding new or removing some documents from local docs, you have to create a new collection. Edit: in 2.5.4, the index gets automatically updated and there is a progress bar while the index is being built.

ThiloteE avatar Oct 19 '23 13:10 ThiloteE

If following comment from June 2023 in discord (https://discord.com/channels/1076964370942267462/1096503917262667827/1119879022290141224) is still correct, then content in LocalDocs will only be mentioned, if the snippet has at least three words in common with the prompt. o.O

Localdocs parameters

document-snippet-size N: Make every snippet N characters in length.

Document-snippet-size determines that all snippets extracted from the documents be N characters in length, setting it to something large(example: 512) increases the likelihood that the snippet will have at least 3 words in common with the prompt, which will lead to it being used as context that will be provided to the model during generation. But it will also massively slow down generation, as the model will need to chew through much more tokens of context before it can start to generate.

document-snippets-to-process-per-prompt N: Use only the N best snippets as the context provided to the model.

Document-snippets-to-process-per-prompt determines how many best matching snippets to add as context. Snippets are ranked based on how many exact-word matches(not case sensitive eg., "now" and "NoW" are a match, but not "Dog" and "Dogs") they have with the prompt. If there are no snippets that have at least 3(hard-coded into the app) matching words with the prompt, the prompt is sent straight to the model, without providing it with any snippets as context.

@Rockavik can you still reproduce, if you create a charactersheet that has every info mentioned three times?

ThiloteE avatar Nov 14 '23 15:11 ThiloteE

First, thanks to ThiloteE for great tips. Second, I think most of "its not working for me" comments above, I believe, fall under the "wait for indexing to finish" category. Third, and this is my issue, how long does indexing takes for you guys? My folder has 100+ GB of PDFs. Over night, i.e. about 12 hours, it got to less then 10%, but it was using 20% of my MacBook Pro's M2 Pro the entire time. I am now trying with a subset of about 10gb. I am wondering if this is expected behaviour (it says somewhere wait for a minute or two for indexing)? Should I try a clean install?

Thanks for your input in advance Ben

BenjaminStular avatar Dec 03 '23 07:12 BenjaminStular

This is expected. Adding a large collection to the index will take a long time. I would suggest to experiment with a single digit number of documents first, before you add a larger collection.

ThiloteE avatar Dec 03 '23 10:12 ThiloteE

Just thought to mention a few thing Re indexing large data folders:

  1. divided folders into 1GB chunks = cca. 1.5x speed
  2. extractng txt from PDFs = cca. 2x speed
  3. throwing it at MacStudio w/ M2 Max instead of MacBook Pro w/ M2 Pro = cca. 5x speed Bottom line: with a good desktop 1000 txt files = 125 MB will index in 2-3 days. Cheers Ben

BenjaminStular avatar Dec 04 '23 22:12 BenjaminStular

There is definitely room for improving the indexing speeds - if it could run without a model loaded we would be able to use the GPU and greatly increase performance. Though I don't think our Vulkan backend currently supports Bert, so there is some work to be done there.

cebtenzzre avatar Dec 05 '23 18:12 cebtenzzre

The database wouldn't be created in my case. I had the directory chosen in local documents setting. I tried removing the file part at the beginning of the directory link also. I tried a single text file with a single line in the directory.

Nothing makes anything show up in the Local Documents database. Restarting GPT4All resets the LocalDocs collections in settings to blank.

GPT 2.5.4

Kuiriel avatar Dec 16 '23 04:12 Kuiriel

Nothing makes anything show up in the Local Documents database.

Try asking on our Discord. You may be missing a step.

cebtenzzre avatar Dec 16 '23 04:12 cebtenzzre

I did miss a step. A really freaking obvious one too, ha.

So I configured the collection, but I never hit the button ADD on the side - because in my brain I was only testing with that one collection. Click ADD and now it's in there.

It's indexing it now.

Very much a PBKAC issue.

Kuiriel avatar Dec 16 '23 05:12 Kuiriel

Maybe GPT4all have difficulty in analyzing PDF files when using local docs. Here is the test file: https://arxiv.org/abs/2005.11401

Base model: Mistral Opencra. Question: What is the meaning of RAG?

Exp1: (Without Using Local File)

Response1:

The term "RAG" typically refers to a system used for project management, called RAG Status. It stands for Red, Amber, Green (sometimes also known as Traffic Light System). This method uses colors to represent different levels of progress or status in a project, with each color representing a specific meaning:

1. Red (R): Indicates that the task is not on track and requires immediate attention or action. It may be due to delays, issues, or other problems that need resolution.
2. Amber (A): Represents tasks that are currently at risk or potentially off-track but still within a manageable range. These tasks require monitoring and might need additional resources or adjustments in the project plan.
3. Green (G): Signifies tasks that are on track, meeting expectations, and progressing as planned. This color indicates that no immediate action is needed for these tasks, but they should continue to be monitored.

The RAG Status system helps project managers and team members quickly understand the status of various tasks or projects at a glance, enabling better decision-making and more effective resource allocation.

Exp2: (Using Local File)

Response2:

RAG stands for Reasoning with Adaptive Knowledge Graph. It's an AI model designed to reason about information using a knowledge graph and adapt it based on new data. This allows it to generate answers or responses that are more contextually relevant, as opposed to simply searching through a predefined set of documents like traditional search engines do.

---
1. Date: 2023, December 17. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf. Page 2. Context
2. Date: 2023, December 17. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf. Page 5. Context
3. Date: 2023, December 17. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf. Page 10. Context

Apparently, GPT4All didn't get the meaning of RAG from this file.

Exp3: (Using Local File + .txt file)

In the local folder, I add another file RAG.txt with following contents:

RAG denotes: Retrival-Augmented Generation

And then, we asked the same question to get the following answer:

RAG stands for "Retrieval-Augmented Generation." It's a model that combines retrieval and generation capabilities, allowing it to access external knowledge from text sources while generating responses. This approach provides more control and interpretability, making it suitable for various scenarios with direct benefits to society.

---
1. Date: 2023, December 17. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf. Page 2. Context
2. Date: 2023, December 17. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf. Page 10. Context
3. Date: 2023, December 17. In RAG.txt. Context

So, maybe there is something wrong with analyzing PDF files. Hope above experiments helps 😄

pprp avatar Dec 17 '23 15:12 pprp

Can you specify where the local docs db is created so that they can be in a sperate directory than the location of the LLM gguf that was used?

kcmclau21122 avatar Jan 14 '24 22:01 kcmclau21122

Had the same problem before, relaized that the download path folder also consist of other data including chache. NOTE: This was on my windows pc.

Deleted all files including the embeddings_v0.dat, which solved the indexing and embedding issue. Perhaps you can just delete the embeddings_vX.dat file, which should solved it.

Would recommend to add an embeddings deletion function, which forces the current embeddings file to be deleted.

indrawibawa85 avatar Feb 15 '24 11:02 indrawibawa85

Part of the issue might be a change in gpt4all behavior. In the past, when you created a localdocs db, it showed the db and had a bar that showed embedding progress. That bar is gone and the directory does not appear. Is this a bug, a change in behavior, or something else?

jafrank88 avatar Feb 16 '24 00:02 jafrank88

@jafrank88 Have you clicked the "add" button? In GPT4All 2.7.0, the progress bar (indexing) now shows, if you click on the local documents button in the chat gui. I don't think you see it in the settings gui.

ThiloteE avatar Feb 16 '24 01:02 ThiloteE

It worked once I deleted the localdocs_v1.db as well as the embeddings_v0.dat

jafrank88 avatar Feb 16 '24 01:02 jafrank88

Also relevant: https://github.com/nomic-ai/gpt4all/issues/1958

ThiloteE avatar Feb 16 '24 02:02 ThiloteE

Throwing my hat into the ring. gpt4all 2.7.1 localdocs provides hallucination responses when using GPT-Falcon. same prompt works fine with mistral-7b-openorca.Q4_0

spacklewoof avatar Feb 23 '24 21:02 spacklewoof

This issue is no longer relevant given the old context of a very old version of localdocs that has since been replaced by v3.0

manyoso avatar Jul 11 '24 13:07 manyoso