docling
docling copied to clipboard
Add Parallelization Support to `convert_all()` Function with `num_worker` Parameter
Requested feature
I propose adding a parallelization option to the convert_all()
function by introducing an additional parameter, such as num_worker
. This feature would allow users to specify the number of workers to process conversions concurrently, significantly improving performance for large datasets.
Currently, the convert_all()
function processes documents sequentially by returning an iterator. This approach can be slow when dealing with a large number of documents. Parallelization would enable faster processing and better utilization of multi-core systems.
Proposed changes:
- Add a
num_worker
parameter to theconvert_all()
function. - Modify the function to use a parallel execution library (e.g.,
concurrent.futures
ormultiprocessing
) to handle multiple conversion tasks simultaneously.
Example usage:
results = converter.convert_all(source, num_worker=4)
Alternatives
- Users can manually implement parallelization by creating multiple instances of the Document Converter for each worker and invoking convert() using custom multiprocessing code. However, this requires additional effort and knowledge, which could be avoided by integrating the feature directly into the library.
- Continue using the current sequential approach, which may be acceptable for small datasets but is inefficient for larger ones.