MONAI icon indicating copy to clipboard operation
MONAI copied to clipboard

Possible RandCropByPosNegLabeld state corruption with persistent workers

Open aarpon opened this issue 6 months ago • 1 comments

I have 512x512 images/label pairs from which I extract 1 256x256 sample using RandCropByPosNegLabeld. In the first epoch, everything is fine: I get a few warnings because some of the labels have no foreground classes (a small fraction of the data) and the pos_ratio is set to 0 before sampling is retried, but the data seems to be fine.

From epoch 2, however, all images fail because there are neither positive nor negative classes found, and the Transform raises a ValueError("No sampling location available.").

Now, if I set persistent_workers=False in my monai.data.DataLoaders, the issue disappears, and all epochs show a few UserWarning: Num foregrounds 0, Num backgrounds 213220, unable to generate class balanced samples, setting 'pos_ratio' to 0. but everything works fine.

It looks like the internal state of the Transform is somehow corrupted if the workers are persistent.

With this in mind, I tried wrapping the RandCropByPosNegLabeld in a custom MapTransform, where I recreate the RandCropByPosNegLabeld every time in the __call__() method before I pass the data to it but, strangely, to no avail.

I never saw this when I was sampling from larger images (2048x2048 or 4096x4096).

Environment

python -c "import monai; monai.config.print_debug_info()"


================================
Printing MONAI config...
================================
MONAI version: 1.4.0
Numpy version: 1.26.4
Pytorch version: 2.6.0+cu124
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: 46a5272196a6c2590ca2589029eed8e4d56ff008
MONAI __file__: /SSD/<username>/Devel/Business/<username>ponti.ch/clients/acurastem/somanet-train/.venv/lib/python3.12/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
ITK version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 5.3.2
scikit-image version: 0.25.2
scipy version: 1.15.3
Pillow version: 11.2.1
Tensorboard version: 2.19.0
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: 0.21.0+cu124
tqdm version: 4.67.1
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: 7.0.0
pandas version: 2.2.3
einops version: 0.8.1
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: 1.1.3
clearml version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies


================================
Printing system config...
================================
System: Linux
Linux version: Fedora Linux 42 (KDE Plasma Desktop Edition)
Platform: Linux-6.15.3-200.fc42.x86_64-x86_64-with-glibc2.41
Processor: 
Machine: x86_64
Python version: 3.12.7
Process name: python3
Command: ['/SSD/aaron/Devel/Business/aaronponti.ch/clients/acurastem/somanet-train/.venv/bin/python3', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 12
Num logical CPUs: 24
Num usable CPUs: 24
CPU usage (%): [33.5, 6.1, 4.0, 3.9, 5.6, 4.5, 3.9, 3.9, 3.9, 4.5, 3.9, 3.9, 70.8, 4.5, 4.5, 4.5, 3.9, 3.9, 3.4, 3.9, 3.4, 3.9, 3.4, 3.9]
CPU freq. (MHz): 2430
Load avg. in last 1, 5, 15 mins (%): [6.6, 7.5, 7.3]
Disk usage (%): 92.6
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 94.2
Available memory (GB): 76.6
Used memory (GB): 15.4

================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 12.4
cuDNN enabled: True
NVIDIA_TF32_OVERRIDE: None
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE: None
cuDNN version: 90100
Current device: 0
Library compiled for CUDA architectures: ['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
GPU 0 Name: NVIDIA GeForce RTX 3060
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 28
GPU 0 Total memory (GB): 11.6
GPU 0 CUDA capability (maj.min): 8.6

Any suggestions?

aarpon avatar Jun 28 '25 18:06 aarpon

@aarpon Hi, I'm not entirely clear on what you mean by pos_ratio being set to 0. Is it not fixed before the data is transformed (via the pos and neg parameters)? Furthermore, I tried to simulate your experiment by generating random tensors of dimensions 512x512 (with some being sparse or all zeros) but could not reproduce your ValueError("No sampling location available.") error with persistent workers after multiple epochs.

If you are still experiencing the issue, it would be great if you could paste a code sample that reflects it.

25benjaminli avatar Sep 15 '25 15:09 25benjaminli