Parallel file loading with DirectoryLoader

Proposal for optionally parallelize file loading with DirectoryLoader.

This uses the process_map function on tqdm which uses ProcessPoolExecutor underneath. Using process_map instead of native ProcessPoolExectuor is the safest way to have the progress bar also work in parallel.

Regarding the tqdm dependency, I am not sure which way to go as there are two alternatives:

Make tqdm a required dep and use the same code to handle both parallel and non parallel scenarios. (current state of PR). This avoids code duplication.
Keep the non parallel execution path separate and only use tqdm for parallel and/or showing progress bar. Would result in some code duplication.

This PR makes more sense as a complement to #4481 which can result in substantial loading time speedups for big folders.

Looking for some early feedback before going further.

Some benchmarks:

Tested with ~559 files with total size of 33MB on an 8 core i7

parallel=False,autodetect_encoding=True

│ %time docs = loader.load()
100%|##########| 559/559 [00:25<00:00, 21.82it/s]
100%|##########| 559/559 [00:25<00:00, 21.78it/s]
CPU times: user 438 ms, sys: 160 ms, total: 598 ms
Wall time: 26.1 s

parallel=True,autodetect_encoding=True

│ %time docs = loader.load()
100%|##########| 559/559 [00:06<00:00, 91.08it/s] 
100%|##########| 559/559 [00:06<00:00, 89.05it/s]]
CPU times: user 419 ms, sys: 252 ms, total: 672 ms
Wall time: 6.73 s

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

@eyurtsev

May 12 '23 22:05 blob42

@blob42 regarding tqdm question, you can add tests the unit test folder. and use the @pytest.mark.requires('tqdm') decorator. It'll skip the test when running core tests, but pick up the test for extended testing (which has tqdm as a dep)

May 13 '23 14:05 eyurtsev

Have to run right not, will try to finish later today or tomorrow! Thanks for the benchmarking!

May 13 '23 14:05 eyurtsev

#4650 resolves this PR

May 17 '23 15:05 blob42

langchain
langchain copied to clipboard

wip: Parallel file loading with DirectoryLoader

Parallel file loading with DirectoryLoader

Some benchmarks:

parallel=False,autodetect_encoding=True

parallel=True,autodetect_encoding=True

Who can review?

langchain langchain copied to clipboard

wip: Parallel file loading with DirectoryLoader

Parallel file loading with DirectoryLoader

Some benchmarks:

parallel=False,autodetect_encoding=True

parallel=True,autodetect_encoding=True

Who can review?

langchain
langchain copied to clipboard