SystemError: null argument to internal routine

Open NastaranVB opened this issue 1 year ago • 0 comments
Hello,
In "Region-based training", when I run TORCHDYNAMO_DISABLE=1 OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0 nnUNetv2_train 200 3d_fullres 0 -tr nnUNetTrainer_250epochs, I encountered the following error.
############################
INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md
############################

Using device: cuda:0
/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-08-02 13:17:33.164039: do_dummy_2d_data_aug: True
2024-08-02 13:17:33.165081: Creating new 5-fold cross-validation split...
2024-08-02 13:17:33.169010: Desired fold for training: 0
2024-08-02 13:17:33.169080: This split has 42 training and 11 validation cases.
using pin_memory on device 0
using pin_memory on device 0
2024-08-02 13:17:40.763903: Using torch.compile...
/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
  warnings.warn(

This is the configuration used by this training:
Configuration name: 3d_fullres
 {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [24, 256, 256], 'median_image_size_in_voxels': [40.0, 525.0, 512.0], 'spacing': [3.999644565582275, 0.390625, 0.390625], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.PlainConvUNet', 'arch_kwargs': {'n_stages': 7, 'features_per_stage': [32, 64, 128, 256, 320, 320, 320], 'conv_op': 'torch.nn.modules.conv.Conv3d', 'kernel_sizes': [[1, 3, 3], [1, 3, 3], [1, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'strides': [[1, 1, 1], [1, 2, 2], [1, 2, 2], [1, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 2]], 'n_conv_per_stage': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm3d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}, 'deep_supervision': True}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': False} 

These are the global plan.json settings:
 {'dataset_name': 'Dataset200_SequentialBrainStroke', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [3.9999990463256836, 0.390625, 0.390625], 'original_median_shape_after_transp': [40, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 2.5233097076416016, 'mean': 0.3446110486984253, 'median': 0.32313862442970276, 'min': -0.7245669960975647, 'percentile_00_5': -0.3643455505371094, 'percentile_99_5': 1.333372712135315, 'std': 0.29900166392326355}}} 

2024-08-02 13:17:42.169218: unpacking dataset...
2024-08-02 13:17:54.758014: unpacking done...
2024-08-02 13:17:54.761492: Unable to plot network architecture: nnUNet_compile is enabled!
2024-08-02 13:17:54.856328: 
2024-08-02 13:17:54.856589: Epoch 0
2024-08-02 13:17:54.856812: Current learning rate: 0.01
2024-08-02 13:19:18.072555: train_loss 0.027
2024-08-02 13:19:18.072933: val_loss -0.0589
2024-08-02 13:19:18.073018: Pseudo dice [0.375, 0.0]
2024-08-02 13:19:18.073103: Epoch time: 83.22 s
2024-08-02 13:19:18.073167: Yayy! New best EMA pseudo Dice: 0.1875
2024-08-02 13:19:20.298824: 
2024-08-02 13:19:20.299044: Epoch 1
2024-08-02 13:19:20.299189: Current learning rate: 0.00996
SystemError: null argument to internal routine
SystemError: null argument to internal routine
SystemError: null argument to internal routine
SystemError: null argument to internal routine
SystemError: null argument to internal routine
SystemError: null argument to internal routine
SystemError: null argument to internal routine
SystemError: null argument to internal routine
SystemError: null argument to internal routine
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 108, in results_loop
    item = in_queue.get()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/torch/multiprocessing/reductions.py", line 496, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib64/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib64/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 512, in Client
    answer_challenge(c, authkey)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 761, in answer_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 220, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 418, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 383, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
2024-08-02 13:20:37.668093: train_loss -0.1005
2024-08-02 13:20:37.668483: val_loss -0.1152
2024-08-02 13:20:37.668637: Pseudo dice [0.4531, 0.0]
2024-08-02 13:20:37.668776: Epoch time: 77.37 s
2024-08-02 13:20:37.668899: Yayy! New best EMA pseudo Dice: 0.1914
2024-08-02 13:20:39.946235: 
2024-08-02 13:20:39.946696: Epoch 2
2024-08-02 13:20:39.946893: Current learning rate: 0.00993
Traceback (most recent call last):
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/bin/nnUNetv2_train", line 33, in <module>
    sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')())
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
    nnunet_trainer.run_training()
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Thanks in advance!
Aug 02 '24 15:08 NastaranVB