private-gpt Parallel document loading with multiprocessing Pool

Faster loading of many documents by loading multiple documents in parallel using thread pool

May 17 '23 15:05 Fabio3rs

Everything looks great!

Thanks

It was missing a int cast when the env file is used for the data. Can you re-review? Thanks.

May 17 '23 17:05 Fabio3rs

You may find that any performance improvement you see will be very dependent on the type of media where the document files are located. That is, attempts to read multiple documents at the same time from rotating media (and to a lesser extent SSD) will tend to be much slower than reading them one at a time because with rotating media moving from one file (disk location) to another is very expensive. Similarly, most SSD devices, while much faster, will have similar issues because they have a limited number of buffers to handle concurrent IOs. The part that can be profitably parallelized is the chunking which is a pure memory operation without a latency gotcha.

May 18 '23 22:05 johnbrisbin

https://github.com/imartinez/privateGPT/pull/292 is merged in master so this PR is now obsolete. One thing we can take from this for the future is a user option (for both CLI and .env) to select the number of threads.

May 20 '23 10:05 PulpCattel

#292 is merged in master so this PR is now obsolete. One thing we can take from this for the future is a user option (for both CLI and .env) to select the number of threads.

Thanks. I think it's better to open a new PR with the user option feature.

May 20 '23 13:05 Fabio3rs