diart icon indicating copy to clipboard operation
diart copied to clipboard

Multithreading in `diart.benchmark`

Open juanmc2005 opened this issue 2 years ago • 3 comments

Problem

Running a benchmark on a huge dataset can take a lot of time. One of the main bottlenecks is that files are processed sequentially.

Idea

Make diart.benchmark (and hence diart.tune) run concurrently on many files at once with a predefined number of workers. It would be great if progress bars could be kept, otherwise we need to find a good solution to show progress.

Another potential problem is having N segmentation and embedding model copies in memory, but since they're stateless there should be a workaround to share them. However I would accept a first version with N models in RAM anyways and think about potential improvements afterwards.

See RxPY concurrency

juanmc2005 avatar Aug 31 '22 09:08 juanmc2005

For progress bars, see p_tqdm, tqdm with locks

juanmc2005 avatar Sep 13 '22 09:09 juanmc2005

Alternative: rich

hbredin avatar Sep 13 '22 11:09 hbredin

There are two options for progress bars:

  1. A single bar where 1 iteration = 1 file (p_tqdm, rich)
  2. Multiple bars where 1 bar = 1 file, and 1 iteration = 1 chunk/batch (tqdm with locks)

I would accept both but strongly prefer the second. I'm sure there's also a workaround for rich.

juanmc2005 avatar Sep 13 '22 12:09 juanmc2005

I've been working on this lately.

Rich works well with multithreading, but for some reason it's extremely slow to spawn new workers (maybe because of the GIL?). When moving to multiprocessing, Rich does not work anymore with multiple bars because the instance of Progress can't be shared between processes. The only solution that I found for this was to use tqdm with locks.

Whenever multiprocessing is not needed, rich is used by default. I'm also implementing it in a way that users can manually choose the progress bar they want.

juanmc2005 avatar Mar 09 '23 14:03 juanmc2005

Implemented in #124

juanmc2005 avatar Mar 10 '23 16:03 juanmc2005