private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Parallel document loading with multiprocessing Pool

Open Fabio3rs opened this issue 2 years ago • 2 comments

Faster loading of many documents by loading multiple documents in parallel using thread pool

Fabio3rs avatar May 17 '23 15:05 Fabio3rs

Everything looks great!

Thanks

It was missing a int cast when the env file is used for the data. Can you re-review? Thanks.

Fabio3rs avatar May 17 '23 17:05 Fabio3rs

You may find that any performance improvement you see will be very dependent on the type of media where the document files are located. That is, attempts to read multiple documents at the same time from rotating media (and to a lesser extent SSD) will tend to be much slower than reading them one at a time because with rotating media moving from one file (disk location) to another is very expensive. Similarly, most SSD devices, while much faster, will have similar issues because they have a limited number of buffers to handle concurrent IOs. The part that can be profitably parallelized is the chunking which is a pure memory operation without a latency gotcha.

johnbrisbin avatar May 18 '23 22:05 johnbrisbin

https://github.com/imartinez/privateGPT/pull/292 is merged in master so this PR is now obsolete. One thing we can take from this for the future is a user option (for both CLI and .env) to select the number of threads.

PulpCattel avatar May 20 '23 10:05 PulpCattel

#292 is merged in master so this PR is now obsolete. One thing we can take from this for the future is a user option (for both CLI and .env) to select the number of threads.

Thanks. I think it's better to open a new PR with the user option feature.

Fabio3rs avatar May 20 '23 13:05 Fabio3rs