diart
diart copied to clipboard
Multithreading in `diart.benchmark`
Problem
Running a benchmark on a huge dataset can take a lot of time. One of the main bottlenecks is that files are processed sequentially.
Idea
Make diart.benchmark
(and hence diart.tune
) run concurrently on many files at once with a predefined number of workers.
It would be great if progress bars could be kept, otherwise we need to find a good solution to show progress.
Another potential problem is having N
segmentation and embedding model copies in memory, but since they're stateless there should be a workaround to share them. However I would accept a first version with N
models in RAM anyways and think about potential improvements afterwards.
See RxPY concurrency
For progress bars, see p_tqdm, tqdm with locks
Alternative: rich
There are two options for progress bars:
- A single bar where 1 iteration = 1 file (p_tqdm, rich)
- Multiple bars where 1 bar = 1 file, and 1 iteration = 1 chunk/batch (tqdm with locks)
I would accept both but strongly prefer the second. I'm sure there's also a workaround for rich.
I've been working on this lately.
Rich works well with multithreading, but for some reason it's extremely slow to spawn new workers (maybe because of the GIL?).
When moving to multiprocessing, Rich does not work anymore with multiple bars because the instance of Progress
can't be shared between processes. The only solution that I found for this was to use tqdm with locks.
Whenever multiprocessing is not needed, rich is used by default. I'm also implementing it in a way that users can manually choose the progress bar they want.
Implemented in #124