run_inference fails on CBS basic for the --use-conda flag
- When the --use-singularity flag is specified on cbs basic run_inference works
- command run:
hippunfold ~/Desktop/lowresMRI/ /localscratch/hipp_output participant --use-conda --modality T1w --cores all
Error message:
/usr/bin/bash: line 1: 32702 Killed nnUNet_predict -i tempimg -o templbl -t Task101_hcp1200_T1w -chk model_best -tr nnUNetTrainerV2 --disable_tta &> logs/sub-01/sub-01_hemi-L_space-corobl_nnunet.txt
[Mon Mar 10 10:13:16 2025]
Error in rule run_inference:
jobid: 0
input: work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-preproc_T1w.nii.gz, /localscratch/.cache/hippunfold/model/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar
output: work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-nnunet_dseg.nii.gz
log: logs/sub-01/sub-01_hemi-L_space-corobl_nnunet.txt (check log file(s) for error details)
conda-env: /local/scratch/hipp_output/.snakemake/conda/3bf59d51ad8aa8841da91b382341fb82_
shell:
mkdir -p tempmodel tempimg templbl && cp work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-preproc_T1w.nii.gz tempimg/temp_0000.nii.gz && tar -xf /localscratch/.cache/hippunfold/model/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar -C tempmodel && export RESULTS_FOLDER=tempmodel && export nnUNet_n_proc_DA=4 && nnUNet_predict -i tempimg -o templbl -t Task101_hcp1200_T1w -chk model_best -tr nnUNetTrainerV2 --disable_tta &> logs/sub-01/sub-01_hemi-L_space-corobl_nnunet.txt && cp templbl/temp.nii.gz work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-nnunet_dseg.nii.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Exiting because a job execution failed. Look above for error message
WorkflowError:
At least one job did not complete successfully.
log file:
nnUNet_raw_data_base is not defined and nnU-Net can only be used on data for which preprocessed files are already present on your system. nnU-Net cannot be used for experiment planning and preprocessing like this. If this is not intended, please read nnunet/paths.md for information on how to set this up properly.
nnUNet_preprocessed is not defined and nnU-Net can not be used for preprocessing or training. If this is not intended, please read nnunet/pathy.md for information on how to set this up.
using model stored in tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1
This model expects 1 input modalities for each image
Found 1 unique case ids, here are some examples: ['temp']
If they don't look right, make sure to double check your filenames. They must end with _0000.nii.gz etc
number of cases: 1
number of cases that still need to be predicted: 1
emptying cuda cache
loading parameters for folds, None
folds is None so we will automatically look for output folders (not using 'all'!)
found the following folds: ['tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_0', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_1', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_2', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_3', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_4']
using the following model files: ['tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_0/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_1/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_2/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_3/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_4/model_best.model']
starting preprocessing generator
starting prediction...
preprocessing templbl/temp.nii.gz
using preprocessor GenericPreprocessor
before crop: (1, 128, 256, 128) after crop: (1, 128, 256, 128) spacing: [0.30000001 0.30000001 0.30000001]
no resampling necessary
no resampling necessary
before: {'spacing': array([0.30000001, 0.30000001, 0.30000001]), 'spacing_transposed': array([0.30000001, 0.30000001, 0.30000001]), 'data.shape (data is transposed)': (1, 128, 256, 128)}
after: {'spacing': array([0.30000001, 0.30000001, 0.30000001]), 'data.shape (data is resampled)': (1, 128, 256, 128)}
(1, 128, 256, 128)
This worker has ended successfully, no errors to report
/local/scratch/hipp_output/.snakemake/conda/3bf59d51ad8aa8841da91b382341fb82_/lib/python3.9/site-packages/torch/autocast_mode.py:141: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
It seems like the job was killed due to out of memory (OOM) error. Apparently, --use-singularity does a better job with resource management and allocation compared to --use-conda. Should we expect the end user to have enough memory and cores to run the hippunfold pipeline or should we limit the resources for the run_inference accordingly? What do you think @akhanf?
While investigating this issue, I noticed that the pipeline takes nearly four times longer to run until the run_inference rule when using the --use-conda flag compared to --use-singularity. This slowdown didn't occur previously with --use-conda, and I suspect it's due to the switch from the official nnUNet Conda package (v1.7.1) to our custom nnUNet package on Khanlab (v1.6.6).
did we confirm whether this was a resources issue? (eg extra conda envs taking up more space on local disk, leaving less space for swap?)
Last time I tried working on this I was unable to run the pipeline with the --report flag for some reason. I'll give this another spin!
did we confirm whether this was a resources issue? (eg extra conda envs taking up more space on local disk, leaving less space for swap?)
I just ran an end-to-end pipeline successfully on CBS Basic! @mackenziesnyder Can you try running this on your end cause it seems to be working fine now!