pathml
pathml copied to clipboard
adding prog bar to slide dataset
Codecov Report
Merging #248 (2d8ca8d) into dev (8298747) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## dev #248 +/- ##
=======================================
Coverage 84.46% 84.47%
=======================================
Files 25 25
Lines 2408 2409 +1
=======================================
+ Hits 2034 2035 +1
Misses 374 374
Impacted Files | Coverage Δ | |
---|---|---|
pathml/core/slide_dataset.py | 71.11% <100.00%> (+0.65%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 8298747...2d8ca8d. Read the comment docs.
Thanks for adding this Surya! So, in general this implementation of a tqdm progress bar won't work well with dask.distributed, because the for loop to submit the jobs will run quickly (low cost to submit each job) but then the program will have to wait for the jobs to complete, which could take substantially longer. Here's an example:
import dask.distributed
from tqdm import tqdm
from time import sleep
def fun():
sleep(10)
return 20
if __name__ == "__main__":
futures = []
results = []
cluster = dask.distributed.LocalCluster(n_workers = 1)
client = dask.distributed.Client(cluster)
for i in tqdm(range(20)):
fut = client.submit(fun)
futures.append(fut)
for future, result in dask.distributed.as_completed(
futures, with_results = True
):
results.append(result)
print(not results is None)
client.shutdown()
If you run the above, you will see that the progress bar finishes but we still have to wait for the jobs to complete
The progress bar in this PR does seem to work, because the for loop waits for each slide to be fully processed before completing - but I think that's actually an inefficiency in our code. It would probably be better to loop through all the slides in the dataset and submit all the jobs first, then loop back through them to finish up adding the processed tiles to h5, etc. because we want to get jobs onto the cluster asap, we don't want the cluster to be waiting idly for new jobs.
I'm not sure if we should merge this without some type of test, since we may end up with a broken progress bar in the future if we refactor the code to remove the blocking in this for loop
Thoughts? @surya-narayanan @ryanccarelli
This is out of my paygrade folks, Idk enough about dask distributed to comment helpfully. I leave it to you.