cucim icon indicating copy to clipboard operation
cucim copied to clipboard

[FEA] Make CuImage pickleable

Open luzuku opened this issue 2 years ago • 9 comments

Is your feature request related to a problem? Please describe. CuImage is currently incompatible with PyTorch's builtin mulitprocessing (specifically with DistributedDataParallel) because it cannot be pickled.

Describe the solution you'd like I'd like CuImage to be pickleable out of the box. Do you think this is possible?

Describe alternatives you've considered The only real alternative is to use PyTorch Lightning's own implementation of DDP which does not have this pickle requirement. In my case this would mean a complete re-implementation of foreign code where a lot could go wrong and I'd rather avoid that if possible.

luzuku avatar Jul 04 '22 09:07 luzuku

Hi @luzuku, Thanks for your interest and question!

CuImage object acts for multiple purposes:

  1. A data loader object that holds internal objects such as file name to load and internal object for TIFF loader which is specific to the system.
  2. An Array data object that holds image data loaded by read_region() method.

In the following example, img variable is for 1) and patch variable is for 2). Calling read_region() method on patch would not load from the file, it just returns a cropped image from the image data in patch. (variable patch2)

It is possible for CuImage to support pickle by serializing/deserializing metadata and image data only (case 2: variable patch or patch2)).

However, if you don't need such metadata, I think it is better to convert a CuImage object to a Numpy/CuPy array and pass it to DDP as shown in the following example. (Converting CuImage data in CPU (by default) to Numpy array wouldn't cause copying memory so it is a zero-copy and just passing a pointer to the memory block and array information).

Could you please let us know if converting to Numpy/CuPy would solve the issue, or please give us a minimal concrete example or a detailed use case that prevents you from using DDP with cuCIM. I would like to know that, if we want to support pickle for CuImage class, derializing/deserializing image data is enough or if we need more than that (e.g, whether recreating a CuImage object with TIFF loader object using the provided file path in a distributed system is needed or not).

Thank you!

from cucim import CuImage
import cupy as cp
import numpy as np
import pickle

img: CuImage = CuImage("./notebooks/input/image.tif")
patch: CuImage = img.read_region((0,0), (256,256))
    
patch2 = patch.read_region((0,0), (128, 128))

patch_arr1 = np.asarray(patch)
patch_arr2 = cp.asarray(patch)

with open('test.pickle', 'wb') as out_file:
  pickle.dump(patch_arr1, out_file) # patch_arr2

print(patch.shape)
print(patch2.shape)
# [256, 256, 3]
# [128, 128, 3]
>>> patch2.metadata
{'cucim': {'associated_images': [],
  'channel_names': ['R', 'G', 'B'],
  'coord_sys': 'LPS',
  'dims': 'YXC',
  'direction': [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]],
  'dtype': {'bits': 8, 'code': 1, 'lanes': 1},
  'ndim': 3,
  'origin': [0.0, 0.0, 0.0],
  'path': './notebooks/input/image.tif',
  'resolutions': {'level_count': 1,
   'level_dimensions': [[128, 128]],
   'level_downsamples': [1.0],
   'level_tile_sizes': [[128, 128]]},
  'shape': [128, 128, 3],
  'spacing': [1.0, 1.0, 1.0],
  'spacing_units': ['micrometer', 'micrometer', 'color'],
  'typestr': '|u1'}}

gigony avatar Jul 06 '22 20:07 gigony

Hi @gigony, thanks for the quick reply!

My use case is basically using cuCIM/CuImage to generate random patches on the fly from ~10.000 TIFF files.

Usually one would pre-generate those patches, save them as jpeg and then just have a list of those to loop over for training. I want to skip this entirely and have a dataset like this:

class Slide:
    def __init__(self, path, region_size):
        self.image = cucim.CuImage(path)
        self.regions = self.find_all_viable_regions(region_size)
        # self.regions = [
        #    [location, size],
        #    ...
        # ]
        
    def read_region(self, *args, **kwargs):
        return np.asarray(self.image.read_region(*args, **kwargs))
        

class SlideDataset(torch.utils.data.Dataset):
    def __init__(self, slide_paths, region_size):
        self.slides = [Slide(path, region_size) for path in slide_paths]
        self.samples = self.make_samples(...)
        # self.samples = [
        #    [slide_idx, region_idx],
        #    ...
        # ]
        
    def __len__():
        return len(self.samples)
        
    def __getitem__(self, idx):
        slide_idx, region_idx = self.samples[idx]
        slide = self.slides[slide_idx]
        location, size = slide.regions[region_idx]
        patch = slide.read_region(location, size)
        return patch

When trying to use this with DDP I just get TypeError: cannot pickle 'cucim.clara._cucim.CuImage' object.

If you need more detail about the use case just let me know!

luzuku avatar Jul 07 '22 12:07 luzuku

IDK if you have already seen this blogpost, but maybe that would be helpful here?

jakirkham avatar Jul 07 '22 16:07 jakirkham

Thanks @jakirkham! I will have a look at dask_image. But as you seem to be one of the main authors, you might already know the answers to my questions:

  1. Can it handle .svs files with any encoding?
  2. Is the JPEG2000 decoder as fast as the one from cuCIM? Because this is my biggest bottleneck.

luzuku avatar Jul 08 '22 08:07 luzuku

Ah actually wasn't too focused on using dask-image for this specifically (though that could be one path forward) and instead borrowing from the technique in the blogpost to simplify management of all of these TIFF files. It could be used with cuCIM for example. Know folks who have used this strategy in other fields outside of microscopy as well

jakirkham avatar Jul 08 '22 09:07 jakirkham

Interesting, thanks for your input! I'll have to familiarize myself with dask then and see how it could possibly improve/speed up my approach. But so far I was actually pretty content with my naive handling and really just a little bit unsatisfied with the pickle situation. I found a way to work around it in my case with LightningLite which supports DDP without the pickle requirement, but it feels messy and if it is not too much work to make CuImage pickleable, which I can't assess really, then why not do it? :shrug:

luzuku avatar Jul 08 '22 13:07 luzuku

Thanks @luzuku for sharing the example! It's a clever approach (loading patches directly, instead of saving them into the file system with an intermediate file format) and that use case is what cuCIM can help!

It seems that you want to keep CuImage object that holds the file path information and use the object to load partial images (patches) of the particular Whole Slide Image.

Since DDP needs to send objects to other nodes by serialization/deserialization, in the object to send, we cannot initialize/store data that cannot pickle. Instead, we can either

  1. store pre-processed data that are pickable
  2. store parameters for the data and do a lazy initialization when used from the distributed nodes.

Since 1) cannot be applied to your use case (data size is large), we can try the second approach and the following change would be what you can try:

class Slide:
    def __init__(self, path, region_size):
-        self.image = cucim.CuImage(path)
+        self.path = path   
+        self.image = None
        self.regions = self.find_all_viable_regions(region_size)
        # self.regions = [
        #    [location, size],
        #    ...
        # ]
    def read_region(self, *args, **kwargs):
        return np.asarray(self.image.read_region(*args, **kwargs))


class SlideDataset(torch.utils.data.Dataset):
    def __init__(self, slide_paths, region_size):
        self.slides = [Slide(path, region_size) for path in slide_paths]
        self.samples = self.make_samples(...)
        # self.samples = [
        #    [slide_idx, region_idx],
        #    ...
        # ]

    def __len__():
        return len(self.samples)

    def __getitem__(self, idx):
        slide_idx, region_idx = self.samples[idx]
        slide = self.slides[slide_idx]
+        if slide.image is None:
+            slide.image = cucim.CuImage(slide.path)
        location, size = slide.regions[region_idx]
        patch = slide.read_region(location, size)
        return patch

This blog post(https://junyonglee.me/research/pytorch/How-to-use-LMDB-with-PyTorch-DataLoader) helped me suggest the solution above.

Please let me know if the above approach would work.

gigony avatar Jul 08 '22 17:07 gigony

Lazy-loading is such a simple and great idea! The overhead is also minimal. Thanks, I will try it out :) Should I close this issue then or do you still think it is worth your while to make CuImage pickleable?

luzuku avatar Jul 11 '22 09:07 luzuku

It might also be worth checking out the h5pickle project, which has some cleverness around minimizing the number of file handles that are open at a time.


Pickling CuImage objects could make sense and understand this could be useful in a variety of scenarios. Though we would need to think about it a bit. A few tricky points that come to mind are

  1. Pickles can be saved to disk (what does this mean especially if the original file is removed).
  2. Without a shared filesystem, pickling may not work between workers (the file may be missing).

We would need to figure out how we want to address those issues. Should add this isn't unique to cuCIM. Other file libraries grapple with similar issues.

jakirkham avatar Jul 11 '22 16:07 jakirkham