DeepLearningExamples
DeepLearningExamples copied to clipboard
[nnUNET/PyTorch] Training step running into "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid."
Related to nnUNET/pyTorch
I am trying to use the BraTS21.ipynb and BraTS22.ipynb to train the nnUNet model. But I am constantly running into the following error. "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid."
Detailed description:
834 training, 417 validation, 1251 test examples
Provided checkpoint /mnt/e/Naveen/Datasets/BraTS2021/check_points/ is not a file. Starting training from scratch.
Filters: [64, 128, 256, 512, 768, 1024],
Kernels: [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]
Strides: [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]]
precision=16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default ModelSummary callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
distributed_backend=nccl All distributed processes registered. Starting with 1 processes
834 training, 417 validation, 1251 test examples LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
0 | model | UNet3D | 177 M
1 | model.input_block | InputBlock | 119 K
2 | model.downsamples | ModuleList | 40.5 M
3 | model.bottleneck | ConvBlock | 49.5 M
4 | model.upsamples | ModuleList | 87.2 M
5 | model.output_block | OutputBlock | 195
6 | model.deep_supervision_heads | ModuleList | 1.2 K
7 | loss | LossBraTS | 0
8 | loss.dice | DiceLoss | 0
9 | loss.ce | BCEWithLogitsLoss | 0
177 M Trainable params
0 Non-trainable params
177 M Total params
709.241 Total estimated model params size (MB)
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s 1.31it/s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Validation ━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━ 3/4 0:00:07 • 0:00:01 1.58it/s
Traceback (most recent call last):
File "/home/navi/nnUNet/notebooks/../main.py", line 128, in
Steps to reproduce the behavior:
- Set up the data directories according to the notebook requirements and launch jupyter lab in the system
- cd into nnUNet/notebooks and open either Brats21.ipynb or Brats22.ipynb in the jupyter notebooks.
- run the training step by giving the command "!python ../main.py --brats --brats22_model --data /mnt/e/Naveen/Datasets/BraTS2021/11_3d/ --results /mnt/e/Naveen/Datasets/BraTS2021/ --ckpt_path /mnt/e/Naveen/Datasets/BraTS2021/check_points/ --ckpt_store_dir /mnt/e/Naveen/Datasets/BraTS2021/check_points/ --scheduler --learning_rate 0.05 --epochs 10 --fold 0 --gpus 1 --amp --task 11 --nfolds 5 --save_ckpt"
Expected behavior: To complete the training step successfully and give the trained model.
Environment
- I have Installed all the requirements according to the requirements.txt
- Pytorch: 2.2.0+cu121
- CUDA: Cuda compilation tools, release 12.1, V12.1.66; Build cuda_12.1.r12.1/compiler.32415258_0
- Platform: WSL2 on Windows
- GPUs in the system: Nvidia Rtx 3090