qsiprep icon indicating copy to clipboard operation
qsiprep copied to clipboard

Preprocessing stuck on hstack_confounds

Open pcamach2 opened this issue 1 year ago • 4 comments

I am testing out a new compute platform and I keep running into the same error in preprocessing in which the workflow stalls during hstack_confounds. The slurm job continues to run until the hard time limit, but no output is produced for the dwi in bids/derivatives/qsiprep - only anatomical preprocessing outputs.

This issue pops up when I use a Singularity container for QSIPrep v0.19.0 or one for QSIPrep v0.19.1; both using the same $SINGULARITY_TMPDIR as the working directory. The command is as follows:

SINGULARITY_CACHEDIR=${scachedir} SINGULARITY_TMPDIR=${stmpdir} singularity run --cleanenv --bind ${IMAGEDIR}:/imgdir,${stmpdir}:/sing_scratch,${projDir}:/data \
${IMAGEDIR}/qsiprep-v0.19.1.sif \
--fs-license-file /imgdir/license.txt /data/bids/sourcedata /data/bids/derivatives \
--output-resolution 2.5 --denoise-method patch2self --resource-monitor -w /sing_scratch \
--nthreads 24 --omp-nthreads 8 --low-mem --longitudinal --separate_all_dwis \
--bids-filter-file /data/bids/derivatives/qsiprep/config_A.json \
-vv --notrack --write-graph --pepolar-method TOPUP \
participant --participant-label ${subject}

The error is first thrown here:

**[Node] Setting-up "qsiprep_wf.single_subject_SAY542_wf.dwi_preproc_ses_A_run_1_wf.pre_hmc_wf.merge_and_denoise_wf.dwi_denoise_ses_A_run_1_dwi_wf.hstack_confounds" in "/sing_scratch/qsiprep_wf/single_subje>
231006-09:52:35,859 nipype.workflow INFO:
         [Node] Executing "hstack_confounds" <qsiprep.interfaces.dwi_merge.StackConfounds>
231006-09:52:35,859 nipype.workflow WARNING:
         [Node] Error on "qsiprep_wf.single_subject_SAY542_wf.dwi_preproc_ses_A_run_1_wf.pre_hmc_wf.merge_and_denoise_wf.dwi_denoise_ses_A_run_1_dwi_wf.hstack_confounds" (/sing_scratch/qsiprep_wf/single_subject_SA>
exception calling callback for <Future at 0x2b916bc90a00 state=finished raised FileNotFoundError>
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/multiproc.py", line 67, in run_node
    result["result"] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 527, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 645, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 722, in _run_command
    result = self._interface.run(cwd=outdir, ignore_exception=True)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 388, in run
    self._check_mandatory_inputs()
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 275, in _check_mandatory_inputs
    raise ValueError(msg)
ValueError: StackConfounds requires a value for input 'in_files'. For a list of required inputs, see StackConfounds.help()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/multiproc.py", line 70, in run_node
    result["result"] = node.result
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 223, in result
    return _load_resultfile(
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/utils.py", line 291, in load_resultfile
    raise FileNotFoundError(results_file)
FileNotFoundError: /sing_scratch/qsiprep_wf/single_subject_SAY542_wf/dwi_preproc_ses_A_run_1_wf/pre_hmc_wf/merge_and_denoise_wf/dwi_denoise_ses_A_run_1_dwi_wf/hstack_confounds/result_hstack_confounds.pklz
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/concurrent/futures/_base.py", line 328, in _invoke_callbacks
    callback(self)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/multiproc.py", line 159, in _async_callback
    result = args.result()
  File "/usr/local/miniconda/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return _load_resultfile(
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/utils.py", line 291, in load_resultfile
    raise FileNotFoundError(results_file)
FileNotFoundError: /sing_scratch/qsiprep_wf/single_subject_SAY542_wf/dwi_preproc_ses_A_run_1_wf/pre_hmc_wf/merge_and_denoise_wf/dwi_denoise_ses_A_run_1_dwi_wf/hstack_confounds/result_hstack_confounds.pklz
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/concurrent/futures/_base.py", line 328, in _invoke_callbacks
    callback(self)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/multiproc.py", line 159, in _async_callback
    result = args.result()
  File "/usr/local/miniconda/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/miniconda/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
FileNotFoundError: /sing_scratch/qsiprep_wf/single_subject_SAY542_wf/dwi_preproc_ses_A_run_1_wf/pre_hmc_wf/merge_and_denoise_wf/dwi_denoise_ses_A_run_1_dwi_wf/hstack_confounds/result_hstack_confounds.pklz
231006-09:52:37,505 nipype.workflow INFO:
         [Job 25] Completed (qsiprep_wf.single_subject_SAY542_wf.dwi_preproc_ses_A_run_1_wf.pre_hmc_wf.merge_and_denoise_wf.conformed_images).
231006-09:52:37,506 nipype.workflow INFO:
         [MultiProc] Running 3 tasks, and 0 jobs ready. Free memory (GB): 452.62/453.22, Free processors: 7/24.
                     Currently running:
                       * qsiprep_wf.single_subject_SAY542_wf.dwi_preproc_ses_A_run_1_wf.pre_hmc_wf.merge_and_denoise_wf.dwi_denoise_ses_A_run_1_dwi_wf.hstack_confounds
                       * qsiprep_wf.single_subject_SAY542_wf.anat_preproc_wf.anat_normalization_wf.anat_nlin_normalization
                       * qsiprep_wf.single_subject_SAY542_wf.anat_preproc_wf.synthseg_anat_wf.synthseg
231006-09:57:19,190 nipype.workflow INFO:

**

And the hstack_confounds is eventually the sole process running until the job times out. The file path exists for /qsiprep_wf/single_subject_SAY542_wf/dwi_preproc_ses_A_run_1_wf/pre_hmc_wf/merge_and_denoise_wf/dwi_denoise_ses_A_run_1_dwi_wf/hstack_confounds/, but it only contains the following:

$ ls qsiprep_wf/single_subject_SAY277_wf/dwi_preproc_ses_A_run_1_wf/pre_hmc_wf/merge_and_denoise_wf/dwi_denoise_ses_A_run_1_dwi_wf/hstack_confounds/            
_inputs.pklz  _node.pklz  _report

where the report.rst reads:

Node: single_subject_SAY277_wf (dwi_preproc_ses_A_run_1_wf (pre_hmc_wf (merge_and_denoise_wf (dwi_denoise_ses_A_run_1_dwi_wf (hstack_confounds (dwi_merge)
==========================================================================================================================================================


 Hierarchy : qsiprep_wf.single_subject_SAY277_wf.dwi_preproc_ses_A_run_1_wf.pre_hmc_wf.merge_and_denoise_wf.dwi_denoise_ses_A_run_1_dwi_wf.hstack_confounds
 Exec ID : hstack_confounds


Original Inputs
---------------


* axis : 1
* in_files : <undefined>
* out_file : <undefined>

This occurs with or without --low-mem. Has anyone else ran into this on an HPC?

pcamach2 avatar Oct 06 '23 16:10 pcamach2

This looks like something that has been happening a lot recently - a subprocess is killed and the rest of the pipeline hangs indefinitely. How much memory are you allowed to access on the compute host? You need quite a bit for synthseg and synthstrip

mattcieslak avatar Oct 06 '23 19:10 mattcieslak

To get a faster node allocation, I try to stay at or below 256G, but I have tried this with up to 500G of memory requested for the slurm job. The same behavior occurs with 16, 24, or 32 cpu threads allocated. I think synthseg and synthstrip are able to complete because there are outputs for the anat workflow:

$ ls bids/derivatives/qsiprep/sub-SAY542/anat
sub-SAY542_desc-aseg_dseg.nii.gz   sub-SAY542_desc-preproc_T1w.nii.gz  sub-SAY542_from-MNI152NLin2009cAsym_to-T1w_mode-image_xfm.h5  sub-SAY542_from-T1wNative_to-T1wACPC_mode-image_xfm.mat
sub-SAY542_desc-brain_mask.nii.gz  sub-SAY542_dseg.nii.gz              sub-SAY542_from-T1wACPC_to-T1wNative_mode-image_xfm.mat       sub-SAY542_from-T1w_to-MNI152NLin2009cAsym_mode-image_xfm.h5

Here are the relevant contents of the working directory:

$ ls qsiprep_wf/*wf/anat*
acpc_aseg_to_dseg    anat_normalization_wf  anat_template_wf    output_grid_wf         rigid_acpc_resample_aseg   rigid_acpc_resample_head  seg2msks  synthseg_anat_wf
anat_derivatives_wf  anat_reports_wf        get_template_image  pad_anat_reference_wf  rigid_acpc_resample_brain  rigid_acpc_resample_mask  seg_rpt   synthstrip_anat_wf

$ ls qsiprep_wf/*wf/anat*/syn*
qsiprep_wf/single_subject_SAY542_wf/anat_preproc_wf/synthseg_anat_wf:
synthseg

qsiprep_wf/single_subject_SAY542_wf/anat_preproc_wf/synthstrip_anat_wf:
mask_brain  mask_to_original_grid  synthstrip

$ ls qsiprep_wf/*wf/anat*/syn*/*
qsiprep_wf/single_subject_SAY542_wf/anat_preproc_wf/synthseg_anat_wf/synthseg:
_0x5c74e38b43160dd058d8e04fcb3e75e9.json  _node.pklz  command.txt           sub-SAY542_ses-A_T1w_lps_trans_aseg.nii.gz  sub-SAY542_ses-A_T1w_lps_trans_qc.csv
_inputs.pklz                              _report     result_synthseg.pklz  sub-SAY542_ses-A_T1w_lps_trans_post.nii.gz

qsiprep_wf/single_subject_SAY542_wf/anat_preproc_wf/synthstrip_anat_wf/mask_brain:
_0xf9b00d8b4bee332bf0b3f022a4f02d7e.json  _inputs.pklz  _node.pklz  _report  command.txt  masked_brain.nii  result_mask_brain.pklz

qsiprep_wf/single_subject_SAY542_wf/anat_preproc_wf/synthstrip_anat_wf/mask_to_original_grid:
_0x320567bc954b0aa201fc9acfb1fa223c.json  _inputs.pklz  _node.pklz  _report  command.txt  result_mask_to_original_grid.pklz  sub-SAY542_ses-A_T1w_lps_trans_mask_trans.nii.gz

ses-A/qsiprep_wf/single_subject_SAY542_wf/anat_preproc_wf/synthstrip_anat_wf/synthstrip:
_0xdfcab5356c161dc2eafc87ec97d1da5b.json  _inputs.pklz  _node.pklz  _report  command.txt  result_synthstrip.pklz  sub-SAY542_ses-A_T1w_lps_trans_brain.nii.gz  sub-SAY542_ses-A_T1w_lps_trans_mask.nii.gz

The only errors that I see in the terminal output are for hstack_confounds - does this depend on synthseg or synthstrip outputs? I will try to run with dwidenoise instead of patch2self to see if the workflow can get past dwi_merge. Any other suggestions would be greatly appreciated! Thanks!

pcamach2 avatar Oct 06 '23 19:10 pcamach2

Using dwidenoise has resolved the stuck process in hstack_confounds:

         [Node] Setting-up "qsiprep_wf.single_subject_SAY277_wf.dwi_preproc_ses_A_run_1_wf.pre_hmc_wf.merge_and_denoise_wf.dwi_denoise_ses_A_run_1_dwi_wf.hstack_confounds" in "/sing_scratch/qsiprep_wf/single_subje>
231006-15:01:34,587 nipype.workflow INFO:
         [Node] Executing "hstack_confounds" <qsiprep.interfaces.dwi_merge.StackConfounds>
231006-15:01:34,594 nipype.workflow INFO:
         [Node] Finished "hstack_confounds", elapsed time 0.006104s.

The problem seems to be related to using patch2self denoising, which seems to complete based on the terminal output but perhaps does not fully connect to the in_files for hstack_confounds?

pcamach2 avatar Oct 06 '23 20:10 pcamach2

I totally missed that you're using patch2self. The wrapper for it in qsiprep could use a refactor, if you're using a lot of threads it can take up an incredible amount of memory

mattcieslak avatar Oct 06 '23 21:10 mattcieslak