Image Functions
As discussed with @LucaMarconato @melonora and @josenimo, I think it would be great to extend the functionality and ease of use of spatialdata by adding a few functions. When I first started using spatialdata I ran into a few issues like napari crashing due to image size, difficult image loading and overall very laborious image handling. The functions I suggest to implement are as follows:
-
Image Loader for universal file formats such as JPG, PNG, TIFF, Ome-TIFF -> from my testing skimage.imread is able to handle all 4 of these formats, maybe a simple addition to Image2DModel or another simple function that loads any image into a dask array?
-
Image size/available GPU memory check -> warn the user if the image size exceeds VRAM since big images may cause napari to stutter/crash -> @melonora suggested the vispy.gloo.gl package
-
A Check/hook to verify that the image has been written to zarr before opening in napari
example for cpu check
def estimate_memory_requirements(dask_array):
# Calculate total number of elements in the array
num_elements = dask_array.size
# Determine the size of each element in bytes
element_size = dask_array.dtype.itemsize
# Total memory requirement in bytes
total_memory = num_elements * element_size
return total_memory
def check_system_resources(memory_required):
# Check available RAM
available_ram = psutil.virtual_memory().available
# Assuming that we will need approximately the same amount of GPU memory
try:
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
available_gpu_memory = pynvml.nvmlDeviceGetMemoryInfo(handle).free
pynvml.nvmlShutdown()
except ImportError:
available_gpu_memory = None # pynvml is not installed or GPU is not available
ram_sufficient = memory_required <= available_ram
gpu_sufficient = available_gpu_memory is None or memory_required <= available_gpu_memory
return ram_sufficient, gpu_sufficient, available_ram, available_gpu_memory
def load_and_check_image(image_path):
dask_array = dask_image.imread.imread(image_path)
memory_required = estimate_memory_requirements(dask_array)
ram_sufficient, gpu_sufficient, available_ram, available_gpu_memory = check_system_resources(memory_required)
if not ram_sufficient:
print(f"\U00002757 Warning: Not enough RAM. Required: {memory_required / (1024**3):.2f} GB, Available: {available_ram / (1024**3):.2f} GB")
if gpu_sufficient is False:
print(f"\U00002757 Warning: Not enough GPU memory. Required: {memory_required / (1024**3):.2f} GB, Available: {available_gpu_memory / (1024**3):.2f} GB")
if ram_sufficient and (gpu_sufficient is None or gpu_sufficient):
print("\U00002705 System resources are sufficient to handle the image load.")
else:
print("\U0000274C System resources are insufficient to handle the image load. Downscaling recommended.")
return dask_array
Hi, thanks for tracking the discussion in a GitHub issue.
Image Loader for universal file formats such as JPG, PNG, TIFF, Ome-TIFF -> from my testing skimage.imread is able to handle all 4 of these formats, maybe a simple addition to Image2DModel or another simple function that loads any image into a dask array?
Yes, that would be very convenient. But I think it better to have this in spatialdata-io so we keep the models minimalistic. Currently we make an exception to this rule and we allow to parse geojson files with the ShapesModel. I'd also move this to spatialdata-io so that we have that spatialdata only reads .zarr files, and all the other extensions are handled by spatialdata-io. The rationale is to avoid maintenance burden due to edge cases in different files extensions. On the other hand the file extensions that you mentioned are very universal, so they could fit the image models. @giovp comments on this?
Summary:
- [ ] (proposing to) add IO convenience functions for common extension in
spaitaldata-io - [ ] (consider to) move the
.geojsonparser tospatialdata-ioso thatspatialdataonly deals with.zarrfiles
A Check/hook to verify that the image has been written to zarr before opening in napari I wonder where we could put this check, because it doesn't just involve
napari-spatialdataand every operation would be slow (spatialdata-plot, query operations, etc), if large image data is not saved (unless the user really wants that for their specific use cases). Maybe we could add this check as an private API inspatialdataand then havenapari-spatialdata,spatialdata-plotand somespatialdataAPIs operate on that. In alternative we could choose to call this function and warn the user whenprint(sdata)is called; maybe better.
Summary:
- [ ] add an internal API that warns the user that the data is not read from disk
- [ ] consider either calling this API when
print(sdata)is called, or inspatialdata-plotandnapari-spatialdata. Probably the first is better.
The function would do the following:
- check the image and raster data;
- the warning would be displayed only if an image/labels element:
- is too big (checking
.shape, here.chunkswould not be important). - AND if the not backed by a Zarr store (this information can be obtained by calling
get_dask_backing_filesand checking if the backing files are a valid Zarr store. The user can disable the warning via a global flag
- is too big (checking
Image size/available GPU memory check -> warn the user if the image size exceeds VRAM since big images may cause napari to stutter/crash -> @melonora suggested the vispy.gloo.gl package For the point above we need a number to now when an image/labels element is too big. Here I would recommend to keep things simple and just have a reasonable constant that the user can choose and not an automatic way to infer this number. The rational is as follows:
- not every machine has an NVIDIA graphic card
- even if one has an NVIDIA graphic card, one may be using another graphic card, or one may not be interested in visualization, or one may have actually less available memory to use because for instance most of the VRAM is taken up by another job
- even if one machine has low RAM, one may be interested to compensate for this using swap memory or other strategy.
Also, I imagine the code not to be portable with the new Apple Silicon architecture, where one doesn't have an NVIDIA graphic card and has mixed RAM/VRAM memory that is handled by the OS.
Summary:
- [ ] better to start by keeping things simple and have a single constant to determine what is "big". Eventually in the future consider some code to automatically infer this.
Final comment:
- [ ] we should add a warning in the
.parse()of raster models when the raster data has a large size but the user didn't specify thescale_factorsargument or has large values for.chunks. That is likely a mistake and would lead to slow performance even if the data is written to disk.
I will nudge this @LucaMarconato since I still really think that this is a bottleneck for large images/not workstations. Also, I think that an image downscaling function would be quite handy. It seems that your tutorials have very nifty lowres, and middleres versions of the raw data. Being able to downscale is critical for other images, especially whole slide imaging. Highly suggest this is added. happy to discuss