gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

[Feature] Setting to turn off automatic reindexing of localDocs collections

Open 3Simplex opened this issue 1 year ago • 4 comments

After long debate I think we've settled on a simple option in localdocs that will turn off all automatic reindexing of localdocs collections.


OLDER ORIGINAL REQUEST

Feature Request

The option to "Lock" a localDocs collection to prevent reindex would be useful to ensure that an important collection remains unchanged. (Larger collections take hours to index and embed.)

Screenshot 2024-07-11 163757

  • A "Lock" button which will disable the "remove" and "rebuild" options from the collection.
  • When "Locked" this collection will not be automatically changed for any reason.

lockCollection1 UnlockCollection1

3Simplex avatar Jul 11 '24 20:07 3Simplex

I would much rather add an "Are you sure?" dialog to both buttons, and add the "Update" button that we have been lacking for a while, which is like Rebuild but non-destructive. I cannot think of any reason to intentionally have a LocalDocs collection be inconsistent with what is actually on disk, which also outweighs the confusion that would likely be caused by such a situation (since you can't actually inspect the collection to see which files are and aren't in it).

e.g., if you are worried that your OneDrive might disconnect and the files will disappear temporarily, you should make a copy of the files instead. There are all manner of sync programs you can use to maintain a copy of a set of files. But trying to build this kind of sync functionality into GPT4All itself seems like unnecessary complexity.

If your use case is suited by e.g. leaving embeddings in cache for some duration in case files are moved or deleted but then restored in short order, I would also prefer that. They could even be cached indefinitely in the collection until you clear the cache. But I don't think GPT4All should ever reference files that currently do not exist at the specified path.

cebtenzzre avatar Jul 15 '24 15:07 cebtenzzre

From what I can see, the DB stores all the data that the files provide. It does not rely on the files to exist in order to function. The program itself requires the files to exist, which triggers actions the user may not want to occur. i.e. For each collection "update db" upon change to the files/structure within the collection, or upon changes to the settings that govern collections.

Screenshot 2024-07-15 124033

I want to choose when my collection is updated. I don't want to rebuild all of my collections because I chose to add a new filetype as a setting. I don't want to rebuild all of my collections because one of my collections needs a larger chunk size and less chunks. I don't want to rebuild when I make one small change to a volatile directory that is otherwise fine.

If I have taken the time to embed for several hours I want it protected now that it is done.

3Simplex avatar Jul 15 '24 17:07 3Simplex

@cebtenzzre I think in the end this is about having a setting that turns off automatic re-indexing when we discover a change through QFileSystemWatcher... some users want to manually control re-indexing. Having that setting (not per collection) plus an 'ARE YOU SURE' dialog I think would get @3Simplex what he's after

manyoso avatar Jul 16 '24 13:07 manyoso

I agree with this feature, as I have just been experiencing this myself.

Synology NAS with several collections. And finding that this would be a feature that would open up a large number of users, both domestically and commercially. I am testing it for the construction industry (Training), and it makes sense that asI have tier 1 company files on it that are industry standards (engineering reports and Australian standards). Of course, the end game is to train an LLM on these, but I can't recommend gpt4all to architects/builders running with a NAS. This re-indexing on every startup is a show-stopper. image So it is a +1 from me on this one.

Edits: Testing out using Windows Mapped Drives to fix the re-indexing; if it is stable, this is a workaround using NAS files.

dgcruzing avatar Jul 26 '24 13:07 dgcruzing