batchgenerators
batchgenerators copied to clipboard
RuntimeError
Hi, @FabianIsensee ,
I'm using the example "multithreaded_with_batches.ipynb" to generate my own batch data, however , the RuntimeError as follow:
appeared.
Can you offer me some hint to sovle this?
Hi, there must be another error message somewhere in your output. Can you look for it?
Hi!
Apologies Fabian and Chao for invading this issue, but I am having a similar issue than yours and maybe I can clarify the error Chao was getting.
For context, I was trying to run nnUNet with modified code. I changed the max_num_epochs from 1000 to 400, and the lr threshold from 1e-6 to 5e-3. Furthermore, I was getting stuck at validation so I implemented the change mentioned in https://github.com/MIC-DKFZ/nnUNet/issues/902. As you can see, I commented the original code and substituted it by the following (line 662 of nnUNetTrainer):
# changed by vicent 09/03/22 to speed up validation according to github issue #902
# results.append(export_pool.starmap_async(save_segmentation_nifti_from_softmax,
# ((softmax_pred, join(output_folder, fname + ".nii.gz"),
# properties, interpolation_order, self.regions_class_order,
# None, None,
# softmax_fname, None, force_separate_z,
# interpolation_order_z),
# )
# )
# )
save_segmentation_nifti_from_softmax(softmax_pred, join(output_folder, fname + ".nii.gz"),
properties, interpolation_order, self.regions_class_order,
None, None,
softmax_fname, None, force_separate_z,
interpolation_order_z)
I don't believe that this is the problem though, since my error happens at the very beginning of training, and this change I believe it mainly affects validation.
Anyways, this is the error message I got. As you can see, I think that there is no useful message to understand what is going on, only that the exception is happening in thread 4.
loading dataset
loading all case properties
unpacking dataset
done
2023-03-09 21:32:24.056818: lr: 0.01
using pin_memory on device 0
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
raise RuntimeError("Abort event was set. So someone died and we should end this madness. \nIMPORTANT: "
RuntimeError: Abort event was set. So someone died and we should end this madness.
IMPORTANT: This is not the actual error message! Look further up to see what caused the error. Please also check whether your RAM was full
Traceback (most recent call last):
File "/home/vcaselles/anaconda3/envs/dents/bin/nnUNet_train", line 8, in <module>
sys.exit(main())
File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/run/run_training.py", line 180, in main
trainer.run_training()
File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/training/network_training/nnUNetTrainerV2_epoch400_lr_thr_0005.py", line 441, in run_training
ret = super().run_training()
File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/training/network_training/nnUNetTrainer_modvalidation.py", line 317, in run_training
super(nnUNetTrainer_modvalidation, self).run_training()
File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/training/network_training/network_trainer.py", line 418, in run_training
_ = self.tr_gen.next()
File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 182, in next
return self.__next__()
File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 206, in __next__
item = self.__get_next_item()
File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 190, in __get_next_item
raise RuntimeError("MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of "
RuntimeError: MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of your workers crashed. This is not the actual error message! Look further up your stdout to see what caused the error. Please also check whether your RAM was full
Thank you very much for your attention, and apologies for the long and dense message, I hope I was clear enough.
Best regards,
Vicent Caselles
PS: To get the run_training.py function to work, I had to also change main()
to also accept my modified trainer in the workflow. I don't think that might be the issue though.
Is the all the text output you got? Can you please share everything? Usually there is an error message hidden somewhere
Why not just use nnUNet_train
?
Hi Fabian, thank you very much for your response. Regarding your questions:
- Yes, that was all the error output I got, unfortunately
- I created a new custom nnUnet_trainer class with my custom max_num_epochs and lr threshold, both defined in the init function inside the class. Did I make a mistake doing that??
I honestly think that the issue raising the error was a lack of RAM, since I was using a server with a great GPU but terrible RAM (~2 GB or so), so odds are that that was the issue.
Thanks again for your time!!
Vicent Caselles
Yeah sounds like it. Are you certain about 2GB? That's year 2000 level of RAM
Yes, it was the cheapest AWS server with CUDA... It was 4 GB tops.
Vicent