MONAI icon indicating copy to clipboard operation
MONAI copied to clipboard

PersistentDataset not usable anymore (v1.5.1) ?

Open SebGoll opened this issue 2 months ago • 2 comments

Since the modifications to the PersistentDataset made in version 1.5.1, the persistent dataset does not allow to save and load metadata. This mean we cannot invert back the initial transforms or even know the original filename (and probably a lot more).

To Reproduce Steps to reproduce the behavior:

  1. Install MONAI version 1.5.1
  2. Load and transform an image through the PersistentDataset
  3. Re-load the image from cache
  4. Try to invert and save the image

Expected behavior The original image and outputted image should be the same

Code used

data = ["./image.nrrd"]
transforms = Compose([LoadImage(ensure_channel_first=True), Rotate90()])
ds=PersistentDataset(data, transform=transforms, cache_dir="./cache")
reloaded_data = ds[0]
reloaded_data = ds[0]
inverter = Invert(transforms)
inverted = inverter(reloaded_data)
saver = SaveImage("./out", output_ext=".nrrd")
saver(inverted)

Let me know if there is a way to bypass this behavior, thank you.

SebGoll avatar Oct 22 '25 08:10 SebGoll

#8566 fixed the vulnerabilities associated with the pickle module and torch.load(..., weights_only=False) by defensively serialising PyTorch tensors only. It looks like MetaTensors didn't make the cut.

https://github.com/Project-MONAI/MONAI/blob/9c6d819f97e37f36c72f3bdfad676b455bd2fa0d/monai/data/dataset.py#L211-L213

Objects are saved to cache after being converted to PyTorch tensors with the default argument of track_meta=False

https://github.com/Project-MONAI/MONAI/blob/9c6d819f97e37f36c72f3bdfad676b455bd2fa0d/monai/data/dataset.py#L401

and loaded in weights-only mode

https://github.com/Project-MONAI/MONAI/blob/9c6d819f97e37f36c72f3bdfad676b455bd2fa0d/monai/data/dataset.py#L380

I've tried converting with track_meta=True but that sets this off

https://github.com/Project-MONAI/MONAI/blob/9c6d819f97e37f36c72f3bdfad676b455bd2fa0d/monai/data/dataset.py#L384-L387

and recomputes the tensor every time, effectively bypassing the main benefit of caching.


While I appreciate the severity of the security issues addressed by #8566 is high, I feel like the breaking nature of this change wasn't properly advertised. The pull request marked itself as a non-breaking change, its pre-merge check even identified this problem, and we get this notice at the end of a class docstring:

https://github.com/Project-MONAI/MONAI/blob/9c6d819f97e37f36c72f3bdfad676b455bd2fa0d/monai/data/dataset.py#L214

This warranted a deprecation path in 1.5.x, or bumping the version to 1.6.0 at the very least per semantic versioning conventions. This is a breaking change that fundamentally alters PersistentDataset behaviour for anyone using transforms that produce MetaTensors i.e. most non-trivial preprocessing pipelines.

iyassou avatar Oct 22 '25 10:10 iyassou

Environment:

  • MacOS 15.6.1 (24G90)
  • sys.version == '3.12.11 (main, Jul 11 2025, 22:26:01) [Clang 20.1.4 ]'
  • uv -v: uv 0.7.21 (Homebrew 2025-07-14)
  • uv add monai[all]==1.5.1

After some more digging it seems like MetaTensors are supported, but NumPy arrays aren't.

@SebGoll I've recreated your specific example using this sample .nrrd file:

from monai.data.dataset import PersistentDataset
from monai.transforms import LoadImage, Rotate90, Compose, Invert, SaveImage
from pathlib import Path

img = Path("./BallBinary30x30x30.nrrd")
transforms = Compose(
    [
        LoadImage(ensure_channel_first=True),
        Rotate90(),
    ]
)
ds = PersistentDataset([img], cache_dir=".", transform=transforms)
_ = ds[0]
rotated_from_cache = ds[0]
inverter = Invert(transforms)
inverted = inverter(rotated_from_cache)
saver = SaveImage("./out", output_ext=img.suffix)
saver(inverted)

Running this example as is yields:

Corrupt cache file detected: 384d57fe2ef3e1e8844bd384282a9808.pt. Deleting and recomputing.
2025-10-22 19:43:47,591 INFO image_writer.py:197 - writing: out/BallBinary30x30x30/BallBinary30x30x30_trans.nrrd

I was able to get the cache file read by:

  1. modifying the MONAI source to add track_meta=True at this line https://github.com/Project-MONAI/MONAI/blob/9c6d819f97e37f36c72f3bdfad676b455bd2fa0d/monai/data/dataset.py#L401
  2. registering these NumPy types and TraceKeys as safe globals before using PersistentDataset:
import monai.utils
import numpy as np
import torch

torch.serialization.add_safe_globals([
    np._core.multiarray._reconstruct,
    np.ndarray,
    np.dtype,
    np.dtypes.Int64DType,
    np.dtypes.Float64DType,
    monai.utils.enums.TraceKeys,
])

I think the broader issue remains: this is a breaking change to core functionality that was introduced in a patch release without deprecation warnings or migration guidance. Users upgrading from 1.5.0 to 1.5.1 will find their caches invalidated with no clear path forward beyond manually clearing and rebuilding them.

Note: I also observed differences between the input and output .nrrd files when comparing them directly (54KB vs 108KB), but I think that's due to the input file using shorts, but LoadImage converting to float32 and Rotate90 introducing minor floating-point precision effects in the spatial metadata.

iyassou avatar Oct 22 '25 19:10 iyassou

Hello @SebGoll and @iyassou,

I had noticed similar issues with PersistentDataset no longer supporting MetaTensor objects, so I have submitted this PR with my solution. PersistentDataset now accepts track_meta and weights_only directly, allowing for MetaTensors to be cached and read with track_meta=True and weights_only=False. The default arguments preserve the current behavior of the library, so it does not resolve the backwards compatibility issue you had mentioned.

mccle avatar Nov 12 '25 05:11 mccle