Parallel document loading with multiprocessing Pool
Faster loading of many documents by loading multiple documents in parallel using thread pool
Everything looks great!
Thanks
It was missing a int cast when the env file is used for the data. Can you re-review? Thanks.
You may find that any performance improvement you see will be very dependent on the type of media where the document files are located. That is, attempts to read multiple documents at the same time from rotating media (and to a lesser extent SSD) will tend to be much slower than reading them one at a time because with rotating media moving from one file (disk location) to another is very expensive. Similarly, most SSD devices, while much faster, will have similar issues because they have a limited number of buffers to handle concurrent IOs. The part that can be profitably parallelized is the chunking which is a pure memory operation without a latency gotcha.
https://github.com/imartinez/privateGPT/pull/292 is merged in master so this PR is now obsolete. One thing we can take from this for the future is a user option (for both CLI and .env) to select the number of threads.
#292 is merged in master so this PR is now obsolete. One thing we can take from this for the future is a user option (for both CLI and
.env) to select the number of threads.
Thanks. I think it's better to open a new PR with the user option feature.