smriprep
smriprep copied to clipboard
autorecon3 issue
Running fmriprep (v1.4.0) gives me random i/o-related autorecon3 errors in around half the subjects, similar to this.
Random, as it is not always the same subjects across different executions, and not always the same error. If I run it without T2w, T2.prenorm.mgz-related errors are replaced by others (for instance during mri_segstats: "No such file or directory; ERROR: loading mri/wmparc.mgz" (while the file exists))
I am running one subject per instance with 64GB memory and docker having access to nearly all.
I thought this might be because freesurfer_dir is shared on an nfs, so I tried the following two approaches, which did not do anything: i) mounted a local, unshared directory; ii) use a not-mounted directory inside the docker container. Hence, https://github.com/poldracklab/smriprep/issues/44 won't fix this.
However, freesurfer works fine when either run via bids/freesurfer:v6.0.1-5, or when running fmriprep with 1 cpu, which led me to believe this might be some fmriprep-parallelization issue.
Since autorecon3 potentially is run for lh and rh simultaneously, might the problem be that both processes try writing non-hemi-specific files at the same time (like during -T2pial or -wmparc)?
smriprep says that
The excluded steps in the second and third stages (
-no<option>
) are not fully hemisphere independent, and are therefore postponed to the final two stages.
but if autorecon 3 is run for lh and rh, won't issues occur there?
Thanks for reporting this - we are extremely interested in getting to the bottom of this problem. And yes, it is an open issue for which we haven't been able to replicate in-house.
@effigies, does Franz's hypothesis about parallelization sound plausible to you?
If doing it with a single job at a time works, then it's not a parallelism problem, but a concurrency problem. That is, the data dependencies are correct, but it's possible that there is a race condition where one hemisphere ends up modifying a file just as another tries to read it. This seems strange, since recon-all -parallel
does basically the same thing we do. The main difference is that, instead of letting each hemisphere progress as it can, recon-all -parallel
runs each parallelizable job as two concurrent processes.
So it may be that we're getting out of lockstep, so that it's not a race condition where both are running -segstats
, but one is running -segstats
, while the other is running something that fiddles with wmparc.mgz
.
One question is: Does this repeat when resuming? Our recon-all
jobs will not try to re-run portions that have already completed, so race conditions should not reproduce consistently, as the timing should be relatively narrow.
I can confirm that resuming does not lead to this problem