hippunfold icon indicating copy to clipboard operation
hippunfold copied to clipboard

run_inference fails on CBS basic for the --use-conda flag

Open mackenziesnyder opened this issue 9 months ago • 4 comments

  • When the --use-singularity flag is specified on cbs basic run_inference works
  • command run: hippunfold ~/Desktop/lowresMRI/ /localscratch/hipp_output participant --use-conda --modality T1w --cores all

Error message:

/usr/bin/bash: line 1: 32702 Killed                  nnUNet_predict -i tempimg -o templbl -t Task101_hcp1200_T1w -chk model_best -tr nnUNetTrainerV2 --disable_tta &> logs/sub-01/sub-01_hemi-L_space-corobl_nnunet.txt
[Mon Mar 10 10:13:16 2025]
Error in rule run_inference:
    jobid: 0
    input: work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-preproc_T1w.nii.gz, /localscratch/.cache/hippunfold/model/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar
    output: work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-nnunet_dseg.nii.gz
    log: logs/sub-01/sub-01_hemi-L_space-corobl_nnunet.txt (check log file(s) for error details)
    conda-env: /local/scratch/hipp_output/.snakemake/conda/3bf59d51ad8aa8841da91b382341fb82_
    shell:
        mkdir -p tempmodel tempimg templbl && cp work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-preproc_T1w.nii.gz tempimg/temp_0000.nii.gz && tar -xf /localscratch/.cache/hippunfold/model/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar -C tempmodel && export RESULTS_FOLDER=tempmodel && export nnUNet_n_proc_DA=4 && nnUNet_predict -i tempimg -o templbl -t Task101_hcp1200_T1w -chk model_best -tr nnUNetTrainerV2 --disable_tta &> logs/sub-01/sub-01_hemi-L_space-corobl_nnunet.txt && cp templbl/temp.nii.gz work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-nnunet_dseg.nii.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Exiting because a job execution failed. Look above for error message
WorkflowError:
At least one job did not complete successfully.

log file:

nnUNet_raw_data_base is not defined and nnU-Net can only be used on data for which preprocessed files are already present on your system. nnU-Net cannot be used for experiment planning and preprocessing like this. If this is not intended, please read nnunet/paths.md for information on how to set this up properly.
nnUNet_preprocessed is not defined and nnU-Net can not be used for preprocessing or training. If this is not intended, please read nnunet/pathy.md for information on how to set this up.
using model stored in  tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1
This model expects 1 input modalities for each image
Found 1 unique case ids, here are some examples: ['temp']
If they don't look right, make sure to double check your filenames. They must end with _0000.nii.gz etc
number of cases: 1
number of cases that still need to be predicted: 1
emptying cuda cache
loading parameters for folds, None
folds is None so we will automatically look for output folders (not using 'all'!)
found the following folds:  ['tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_0', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_1', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_2', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_3', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_4']
using the following model files:  ['tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_0/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_1/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_2/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_3/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_4/model_best.model']
starting preprocessing generator
starting prediction...
preprocessing templbl/temp.nii.gz
using preprocessor GenericPreprocessor
before crop: (1, 128, 256, 128) after crop: (1, 128, 256, 128) spacing: [0.30000001 0.30000001 0.30000001] 

no resampling necessary
no resampling necessary
before: {'spacing': array([0.30000001, 0.30000001, 0.30000001]), 'spacing_transposed': array([0.30000001, 0.30000001, 0.30000001]), 'data.shape (data is transposed)': (1, 128, 256, 128)} 
after:  {'spacing': array([0.30000001, 0.30000001, 0.30000001]), 'data.shape (data is resampled)': (1, 128, 256, 128)} 

(1, 128, 256, 128)
This worker has ended successfully, no errors to report
/local/scratch/hipp_output/.snakemake/conda/3bf59d51ad8aa8841da91b382341fb82_/lib/python3.9/site-packages/torch/autocast_mode.py:141: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')

mackenziesnyder avatar Mar 10 '25 16:03 mackenziesnyder

It seems like the job was killed due to out of memory (OOM) error. Apparently, --use-singularity does a better job with resource management and allocation compared to --use-conda. Should we expect the end user to have enough memory and cores to run the hippunfold pipeline or should we limit the resources for the run_inference accordingly? What do you think @akhanf?

Dhananjhay avatar Mar 11 '25 10:03 Dhananjhay

While investigating this issue, I noticed that the pipeline takes nearly four times longer to run until the run_inference rule when using the --use-conda flag compared to --use-singularity. This slowdown didn't occur previously with --use-conda, and I suspect it's due to the switch from the official nnUNet Conda package (v1.7.1) to our custom nnUNet package on Khanlab (v1.6.6).

Image Image

Dhananjhay avatar Mar 18 '25 13:03 Dhananjhay

did we confirm whether this was a resources issue? (eg extra conda envs taking up more space on local disk, leaving less space for swap?)

akhanf avatar Apr 17 '25 14:04 akhanf

Last time I tried working on this I was unable to run the pipeline with the --report flag for some reason. I'll give this another spin!

Dhananjhay avatar Apr 17 '25 14:04 Dhananjhay

did we confirm whether this was a resources issue? (eg extra conda envs taking up more space on local disk, leaving less space for swap?)

I just ran an end-to-end pipeline successfully on CBS Basic! @mackenziesnyder Can you try running this on your end cause it seems to be working fine now!

Dhananjhay avatar May 16 '25 18:05 Dhananjhay