docling icon indicating copy to clipboard operation
docling copied to clipboard

Add Parallelization Support to `convert_all()` Function with `num_worker` Parameter

Open naufalso opened this issue 3 months ago • 2 comments

Requested feature

I propose adding a parallelization option to the convert_all() function by introducing an additional parameter, such as num_worker. This feature would allow users to specify the number of workers to process conversions concurrently, significantly improving performance for large datasets.

Currently, the convert_all() function processes documents sequentially by returning an iterator. This approach can be slow when dealing with a large number of documents. Parallelization would enable faster processing and better utilization of multi-core systems.

Proposed changes:

  • Add a num_worker parameter to the convert_all() function.
  • Modify the function to use a parallel execution library (e.g., concurrent.futures or multiprocessing) to handle multiple conversion tasks simultaneously.

Example usage:

results = converter.convert_all(source, num_worker=4)

Alternatives

  1. Users can manually implement parallelization by creating multiple instances of the Document Converter for each worker and invoking convert() using custom multiprocessing code. However, this requires additional effort and knowledge, which could be avoided by integrating the feature directly into the library.
  2. Continue using the current sequential approach, which may be acceptable for small datasets but is inefficient for larger ones.

naufalso avatar Nov 19 '24 03:11 naufalso