Training PermissionError When Accessing dataset_fingerprint.json Created by Another User

Open vmiller987 opened this issue 4 months ago • 0 comments

Hello all! I hope you've been well.

I am training someone to assist me with my ML work and we've stumbled into a permission error on the /path/to/datasets/results/Dataset003_Liver/nnUNetTrainer__nnUNetResEncUNetMPlans__3d_fullres/dataset_fingerprint.json. If another user starts a training and generates dataset_fingerprint.json in the results folder, I am unable to continue training or to start trainings of different folds.

Machine:

CPU: AMD Ryzen™ Threadripper™ PRO 7975WX 32-Core, 64-Thread Processor
Mobo: Pro WS WRX90E-SAGE SE
RAM: ~750Gb
GPUs: 7x RTX 4090, 1x RTX 5090, 2x RTX Pro 6000
OS: RHEL 9.5

(TEST) [vmiller@gluskap TEST]$ CUDA_VISIBLE_DEVICES=6 nnUNet_compile=False nnUNetv2_train 3 3d_fullres all -p nnUNetResEncUNetMPlans
Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2025-07-31 14:35:22.245488: do_dummy_2d_data_aug: False
using pin_memory on device 0
using pin_memory on device 0

This is the configuration used by this training:
Configuration name: 3d_fullres
 {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [128, 128, 128], 'median_image_size_in_voxels': [480.0, 512.0, 512.0], 'spacing': [1.0, 0.7685546875, 0.7685546875], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.ResidualEncoderUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 320, 320], 'conv_op': 'torch.nn.modules.conv.Conv3d', 'kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'strides': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'n_blocks_per_stage': [1, 3, 4, 6, 6, 6], 'n_conv_per_stage_decoder': [1, 1, 1, 1, 1], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm3d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': True} 

These are the global plan.json settings:
 {'dataset_name': 'Dataset003_Liver', 'plans_name': 'nnUNetResEncUNetMPlans', 'original_median_spacing_after_transp': [1.0, 0.7685546875, 0.7685546875], 'original_median_shape_after_transp': [396, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'nnUNetPlannerResEncM', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3941.0, 'mean': 100.4883041381836, 'median': 102.0, 'min': -986.0, 'percentile_00_5': -16.0, 'percentile_99_5': 198.0, 'std': 36.560123443603516}}} 

Traceback (most recent call last):
  File "/home/vmiller/work/TEST/.venv/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/nnunetv2/run/run_training.py", line 266, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/nnunetv2/run/run_training.py", line 207, in run_training
    nnunet_trainer.run_training()
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1363, in run_training
    self.on_train_start()
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 926, in on_train_start
    shutil.copy(join(self.preprocessed_dataset_folder_base, 'dataset_fingerprint.json'),
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/shutil.py", line 436, in copy
    copymode(src, dst, follow_symlinks=follow_symlinks)
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/shutil.py", line 317, in copymode
    chmod_func(dst, stat.S_IMODE(st.st_mode))
PermissionError: [Errno 1] Operation not permitted: '/opt/datasets/FCT/results/Dataset003_Liver/nnUNetTrainer__nnUNetResEncUNetMPlans__3d_fullres/dataset_fingerprint.json'
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 108, in results_loop
    item = in_queue.get()
           ^^^^^^^^^^^^^^
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_sharer.py", line 58, in detach
Exception in thread Thread-1 (results_loop):
Traceback (most recent call last):
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    return reduction.recv_handle(conn)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/reduction.py", line 189, in recv_handle
    self.run()
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012, in run
    return recvfds(s, 1)[0]
           ^^^^^^^^^^^^^
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/reduction.py", line 157, in recvfds
    self._target(*self._args, **self._kwargs)
    msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_SPACE(bytes_size))
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer
    raise e
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 108, in results_loop
    item = in_queue.get()
           ^^^^^^^^^^^^^^
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/connection.py", line 525, in Client
(TEST) [vmiller@gluskap TEST]$

Initially I tried to chmod 777 dataset_fingerprint to see if this fixes it but this did not work.

(TEST) [vmiller@gluskap nnUNetTrainer__nnUNetResEncUNetMPlans__3d_fullres]$ ls -l
total 100
-rw-rw-r--+ 1 sgoodridge opt 19806 Jul 31 14:35 dataset_fingerprint.json
-rwxrwxrwx+ 1 sgoodridge opt   403 Jul 31 14:48 dataset.json
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 30 16:32 fold_0
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 30 16:32 fold_1
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 30 16:33 fold_2
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 30 16:33 fold_3
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 30 16:34 fold_4
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 31 14:48 fold_all
-rwxrwxrwx+ 1 sgoodridge opt 16343 Jul 31 14:48 plans.json
(TEST) [vmiller@gluskap nnUNetTrainer__nnUNetResEncUNetMPlans__3d_fullres]$ sudo chmod 777 dataset_fingerprint.json 
[sudo] password for vmiller: 
(TEST) [vmiller@gluskap nnUNetTrainer__nnUNetResEncUNetMPlans__3d_fullres]$ ls -l
total 100
-rwxrwxrwx+ 1 sgoodridge opt 19806 Jul 31 14:35 dataset_fingerprint.json
-rwxrwxrwx+ 1 sgoodridge opt   403 Jul 31 14:48 dataset.json
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 30 16:32 fold_0
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 30 16:32 fold_1
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 30 16:33 fold_2
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 30 16:33 fold_3
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 30 16:34 fold_4
drwxrwsrwx+ 2 sgoodridge opt  4096 Jul 31 14:48 fold_all
-rwxrwxrwx+ 1 sgoodridge opt 16343 Jul 31 14:48 plans.json
(TEST) [vmiller@gluskap nnUNetTrainer__nnUNetResEncUNetMPlans__3d_fullres]$ CUDA_VISIBLE_DEVICES=6 nnUNet_compile=False nnUNetv2_train 3 3d_fullres all -p nnUNetResEncUNetMPlans
Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2025-07-31 14:50:38.716076: do_dummy_2d_data_aug: False
using pin_memory on device 0
using pin_memory on device 0

This is the configuration used by this training:
Configuration name: 3d_fullres
 {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [128, 128, 128], 'median_image_size_in_voxels': [480.0, 512.0, 512.0], 'spacing': [1.0, 0.7685546875, 0.7685546875], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.ResidualEncoderUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 320, 320], 'conv_op': 'torch.nn.modules.conv.Conv3d', 'kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'strides': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'n_blocks_per_stage': [1, 3, 4, 6, 6, 6], 'n_conv_per_stage_decoder': [1, 1, 1, 1, 1], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm3d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': True} 

These are the global plan.json settings:
 {'dataset_name': 'Dataset003_Liver', 'plans_name': 'nnUNetResEncUNetMPlans', 'original_median_spacing_after_transp': [1.0, 0.7685546875, 0.7685546875], 'original_median_shape_after_transp': [396, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'nnUNetPlannerResEncM', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3941.0, 'mean': 100.4883041381836, 'median': 102.0, 'min': -986.0, 'percentile_00_5': -16.0, 'percentile_99_5': 198.0, 'std': 36.560123443603516}}} 

Traceback (most recent call last):
  File "/home/vmiller/work/TEST/.venv/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/nnunetv2/run/run_training.py", line 266, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/nnunetv2/run/run_training.py", line 207, in run_training
    nnunet_trainer.run_training()
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1363, in run_training
    self.on_train_start()
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 926, in on_train_start
    shutil.copy(join(self.preprocessed_dataset_folder_base, 'dataset_fingerprint.json'),
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/shutil.py", line 436, in copy
    copymode(src, dst, follow_symlinks=follow_symlinks)
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/shutil.py", line 317, in copymode
    chmod_func(dst, stat.S_IMODE(st.st_mode))
PermissionError: [Errno 1] Operation not permitted: '/opt/datasets/FCT/results/Dataset003_Liver/nnUNetTrainer__nnUNetResEncUNetMPlans__3d_fullres/dataset_fingerprint.json'
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception in thread Thread-1 (results_loop):
Traceback (most recent call last):
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 108, in results_loop
    item = in_queue.get()
           ^^^^^^^^^^^^^^
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vmiller/work/TEST/.venv/lib/python3.12/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/home/vmiller/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
(TEST) [vmiller@gluskap nnUNetTrainer__nnUNetResEncUNetMPlans__3d_fullres]$

However, if I change the owner it does work. Also, if I delete dataset_fingerprint.json from the results, my training will regenerate the file with me being the owner, but then the other user cannot train without having to do this as well.

(TEST) [vmiller@gluskap nnUNetTrainer__nnUNetResEncUNetMPlans__3d_fullres]$ sudo chown vmiller:opt dataset_fingerprint.json

This isn't a critical issue, and mainly a minor annoyance. It would be nice if the we could multi-user train without having to deal with file ownership issues.

Jul 31 '25 19:07 vmiller987