gpt4all LocalDocs indexing of 10 small PDFs generate 500Mb/s write for 30 minutes

LocalDocs indexing of 10 small PDFs generate 500Mb/s write for 30 minutes

Open adam-ah opened this issue 1 year ago • 8 comments

System Info

v2.5.4 Win11 Mistral OpenOrca

Reproduction

Start indexing a handful of PDFs (11Pdf, 60mb in total). The resulting localdocs_v1.db file, once indexing is done, is only about 170mb.

However, during indexing, the system will keep writing the embeddings_v0.dat again and again until it reaches about 100Mb, then resets its size to 0, and starts again (managed by https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/embeddings.cpp)

This is generating over 500Mb/s write for over 30 minutes of indexing. That's almost a terrabyte of data written to index 60mb of PDFs even though the resulting localdocs index file is 170mb

Expected behavior

There must be a bug in the indexing, it's both too slow and generates unbelievable amounts of writes. Start with why is the embeddings_v0.dat being thrashed so many times?

Jan 07 '24 00:01 adam-ah

gpt4all gpt4all copied to clipboard

LocalDocs indexing of 10 small PDFs generate 500Mb/s write for 30 minutes

System Info

Reproduction

Expected behavior

gpt4all
gpt4all copied to clipboard