pathml icon indicating copy to clipboard operation
pathml copied to clipboard

adding prog bar to slide dataset

Open surya-narayanan opened this issue 3 years ago • 3 comments

surya-narayanan avatar Dec 07 '21 16:12 surya-narayanan

Codecov Report

Merging #248 (2d8ca8d) into dev (8298747) will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##              dev     #248   +/-   ##
=======================================
  Coverage   84.46%   84.47%           
=======================================
  Files          25       25           
  Lines        2408     2409    +1     
=======================================
+ Hits         2034     2035    +1     
  Misses        374      374           
Impacted Files Coverage Δ
pathml/core/slide_dataset.py 71.11% <100.00%> (+0.65%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8298747...2d8ca8d. Read the comment docs.

codecov-commenter avatar Dec 08 '21 18:12 codecov-commenter

Thanks for adding this Surya! So, in general this implementation of a tqdm progress bar won't work well with dask.distributed, because the for loop to submit the jobs will run quickly (low cost to submit each job) but then the program will have to wait for the jobs to complete, which could take substantially longer. Here's an example:

import dask.distributed
from tqdm import tqdm
from time import sleep

def fun():
    sleep(10)
    return 20

if __name__ == "__main__":

    futures = []
    results = []

    cluster = dask.distributed.LocalCluster(n_workers = 1)
    client = dask.distributed.Client(cluster)

    for i in tqdm(range(20)):
        fut = client.submit(fun)
        futures.append(fut)

    for future, result in dask.distributed.as_completed(
            futures, with_results = True
    ):
        results.append(result)

    print(not results is None)

    client.shutdown()

If you run the above, you will see that the progress bar finishes but we still have to wait for the jobs to complete

The progress bar in this PR does seem to work, because the for loop waits for each slide to be fully processed before completing - but I think that's actually an inefficiency in our code. It would probably be better to loop through all the slides in the dataset and submit all the jobs first, then loop back through them to finish up adding the processed tiles to h5, etc. because we want to get jobs onto the cluster asap, we don't want the cluster to be waiting idly for new jobs.

I'm not sure if we should merge this without some type of test, since we may end up with a broken progress bar in the future if we refactor the code to remove the blocking in this for loop

Thoughts? @surya-narayanan @ryanccarelli

jacob-rosenthal avatar Dec 16 '21 16:12 jacob-rosenthal

This is out of my paygrade folks, Idk enough about dask distributed to comment helpfully. I leave it to you.

surya-narayanan avatar Dec 16 '21 18:12 surya-narayanan