[multiresolutionimageinterface] Speed up patch loading
Hi, I am attempting to write a as-fast-as-possible (tensorflow/python) dataloader for WSI patches. I looked in the issues for keywords like "fast", "speed", "accelerate", but did not find any best practices.
This is what i have tried for CAMELYON 16 dataset. Maybe the maintainers/community can provide some insights?
# Import ASAP lib first!
import sys
sys.path.append('C:\\Program Files\\ASAP 2.1\\bin')
import multiresolutionimageinterface as mir
reader = mir.MultiResolutionImageReader()
# Step 1 - Loop over random anchor points "pre-selected" from whole-slides-images
# res = {patient_key1: KEY_POINTS: [[x1,y1], [x2,y2], ....]}
patch_width = ...
patch_height = ...
patient_level = ...
for patient_key in res:
path_img = ...
path_mask = ...
wsi_img = reader.open(str(path_img))
wsi_mask = reader.open(str(path_mask))
ds_factor = wsi_mask.getLevelDownsample(patient_level)
# Step 2 - Loop over points for a particular patient
for point in res[patient_key][KEY_POINTS]:
wsi_patch_mask = np.array(wsi_mask.getUCharPatch(point[0]) * ds_factor, point[1] * ds_factor, patch_width, patch_height, patient_level))
wsi_patch_img = np.array(wsi_img.getUCharPatch( point[0]) * ds_factor, point[1] * ds_factor, patch_width, patch_height, patient_level))
yield(wsi_patch_img, wsi_patch_mask)
Full code can be found here
My concern is that since I am loading so many patches from the same patient (with some randomization). And then once a fixed set of patches N is loaded from a patient, I move on to the next patient. Is it not possible to speed the patch loading for a patient? Or should I load the whole image at once, but that may lead to memory overflow?
There are a couple of things you can try:
- Use multiprocessing and get patches from several images at once.
- Sample all patches once and write them to disk in a fast format for your DL library of choice (e.g. TFRecords for TensorFlow)
- Try to prevent reading across tile boundaries, the underlying TIFF files are tiled. If you request a region that is the same size as the tilesize, but starts at the center point of the tile, you will need to read 4 tiles to construct the requested tile. This is not always possible and depends on your use-case of course.
Op do 24 nov. 2022 om 14:22 schreef pmod @.***>:
Hi, I am attempting to write a as-fast-as-possible (tensorflow/python) dataloader for WSI patches. I looked in the issues for keywords like "fast", "speed", "accelerate", but did not find any best practices.
This is what i have tried for CAMELYON 16 dataset. Maybe the maintainers/community can provide some insights?
Import ASAP lib first!import syssys.path.append('C:\Program Files\ASAP 2.1\bin')import multiresolutionimageinterface as mirreader = mir.MultiResolutionImageReader()
Step 1 - Loop over random anchor points "pre-selected" from whole-slides-images
res = {patient_key1: KEY_POINTS: [[x1,y1], [x2,y2], ....]}patch_width = ...patch_height = ...patient_level = ...
for patient_key in res:
path_img = ... path_mask = ... wsi_img = reader.open(str(path_img)) wsi_mask = reader.open(str(path_mask)) ds_factor = wsi_mask.getLevelDownsample(patient_level) # Step 2 - Loop over points for a particular patient for point in res[patient_key][KEY_POINTS]: wsi_patch_mask = np.array(wsi_mask.getUCharPatch(point[0]) * ds_factor, point[1] * ds_factor, patch_width, patch_height, patient_level)) wsi_patch_img = np.array(wsi_img.getUCharPatch( point[0]) * ds_factor, point[1] * ds_factor, patch_width, patch_height, patient_level)) yield(wsi_patch_img, wsi_patch_mask)Full code can be found here https://gist.github.com/prerakmody/9237b618c804ca9b99c1fd21e30de496
My concern is that since I am loading so many patches from the same patient (with some randomization). And then once a fixed set of patches N is loaded from a patient, I move on to the next patient. Is it not possible to speed the patch loading for a patient? Or should I load the whole image at once, but that may lead to memory overflow?
— Reply to this email directly, view it on GitHub https://github.com/computationalpathologygroup/ASAP/issues/251, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJIFUEWJW23V5PCKQMJNE3WJ5TYRANCNFSM6AAAAAASKMV7P4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks for the suggestion!
I attempted option 1 as it is feasible for my pipeline. But since .getUCharPatch() is already so fast (less than 0.1 sec for each access (test code)), I did not obtain improvements (or any significant reductions). Note that I used tf.data.Dataset API. Looks like the overhead of multiprocessing adds more time than it saves.
Below is a histogram for 2000 patch accesses using .getUCharPatch() (on different (x,y) coords and WSI's). X-axis=time(s)
