nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

About multi-threading and single-threading

Open a-die opened this issue 8 months ago • 2 comments

Hello, I rewrote a multi-task learning framework for nnUNet. In short, in addition to the segmentation task, there is also a task similar to learning the connectivity between image pixels. I rewrote the dataset, dataloader, and trainer, mainly adding another additional label reading. These are the background. The problem is that when I use multi-threaded dataloader, it seems to cause memory leaks, that is, using the code: Image mt_gen_train = LimitedLenWrapper(self.num_iterations_per_epoch, data_loader=dl_tr, transform=tr_transforms, num_processes=allowed_num_processes, num_cached=6, seeds=None, pin_memory=self.device.type == 'cuda', wait_time=0.02) And the error is as follows: Traceback (most recent call last): File "/home/i/miniconda3/envs/nnUNet/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/home/i/ASOCA/nnUNet/nnunetv2/run/run_training.py", line 290, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/i/ASOCA/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/home/i/ASOCA/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1276, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/home/i/miniconda3/envs/nnUNet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/home/i/miniconda3/envs/nnUNet/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message But there is no specific error message. However, there is no problem using SingleThreadedAugmenter (dl_tr, tr_transforms), but the running speed is very slow. Can anyone give me some ideas for troubleshooting?

a-die avatar Apr 08 '25 01:04 a-die

For training, nnUNet_compile=False helps, but I also have to do nnUNet_n_proc_DA=0 every single time or I get the background workers die when training. I cannot run it without this line.

vmiller987 avatar May 17 '25 12:05 vmiller987

Hello, I have been modifying his network architecture recently, mainly by adding an additional input and output, which is now dual input and dual output. However, I encountered a problem with the data loader. Can we communicate? Thank you.

Yllynette avatar Oct 17 '25 10:10 Yllynette