gpt4all
gpt4all copied to clipboard
LocalDocs indexing of 10 small PDFs generate 500Mb/s write for 30 minutes
System Info
v2.5.4 Win11 Mistral OpenOrca
Reproduction
Start indexing a handful of PDFs (11Pdf, 60mb in total).
The resulting localdocs_v1.db
file, once indexing is done, is only about 170mb.
However, during indexing, the system will keep writing the embeddings_v0.dat
again and again until it reaches about 100Mb, then resets its size to 0, and starts again (managed by https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/embeddings.cpp)
This is generating over 500Mb/s write for over 30 minutes of indexing. That's almost a terrabyte of data written to index 60mb of PDFs even though the resulting localdocs index file is 170mb
Expected behavior
There must be a bug in the indexing, it's both too slow and generates unbelievable amounts of writes.
Start with why is the embeddings_v0.dat
being thrashed so many times?