langchain
langchain copied to clipboard
wip: Parallel file loading with DirectoryLoader
Parallel file loading with DirectoryLoader
Proposal for optionally parallelize file loading with DirectoryLoader.
This uses the process_map function on tqdm which uses ProcessPoolExecutor underneath. Using process_map instead of native ProcessPoolExectuor is the safest way to have the progress bar also work in parallel.
Regarding the tqdm dependency, I am not sure which way to go as there are two alternatives:
- Make tqdm a required dep and use the same code to handle both parallel and non parallel scenarios. (current state of PR). This avoids code duplication.
- Keep the non parallel execution path separate and only use tqdm for parallel and/or showing progress bar. Would result in some code duplication.
This PR makes more sense as a complement to #4481 which can result in substantial loading time speedups for big folders.
Looking for some early feedback before going further.
Some benchmarks:
Tested with ~559 files with total size of 33MB on an 8 core i7
parallel=False,autodetect_encoding=True
│ %time docs = loader.load()
100%|##########| 559/559 [00:25<00:00, 21.82it/s]
100%|##########| 559/559 [00:25<00:00, 21.78it/s]
CPU times: user 438 ms, sys: 160 ms, total: 598 ms
Wall time: 26.1 s
parallel=True,autodetect_encoding=True
│ %time docs = loader.load()
100%|##########| 559/559 [00:06<00:00, 91.08it/s]
100%|##########| 559/559 [00:06<00:00, 89.05it/s]]
CPU times: user 419 ms, sys: 252 ms, total: 672 ms
Wall time: 6.73 s
Who can review?
Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:
- @eyurtsev
@blob42 regarding tqdm question, you can add tests the unit test folder. and use the @pytest.mark.requires('tqdm') decorator. It'll skip the test when running core tests, but pick up the test for extended testing (which has tqdm as a dep)
Have to run right not, will try to finish later today or tomorrow! Thanks for the benchmarking!
#4650 resolves this PR