relion
relion copied to clipboard
Multibody refinement exited with an MPI-related error.
Describe your problem
Multi-body refinement with 2 bodies. The SigmaAngles and SigmaOffset were set to 0 for the smaller domain. If they were not set to 0, everything runs just fine.
Environment:
- OS: Ubuntu 18.04
- MPI runtime: OpenMPI 2.1.1
- RELION version: RELION-3.1.0-commit-1349c5
- Memory: 64 GB
- GPU: 4 GeForce GTX TITAN X
Dataset:
- Box size: 256 px
- Pixel size: 1.33 Å/px
- Number of particles: 50,000
- Description: A tetrameric protein of about 500 kDa
Job options:
- Type of job: MultiBody
- Number of MPI processes: 5
- Number of threads: 2
- Full command (see
note.txt
in the job directory):
`which relion_refine_mpi` --continue Refine3D/job373/run_it018_optimiser.star --o MultiBody/job381/run --solvent_correct_fsc --multibody_masks 2-bodies-mask.star --oversampling 1 --healpix_order 4 --auto_local_healpix_order 4 --offset_range 3 --offset_step 1.5 --reconstruct_subtracted_bodies --dont_combine_weights_via_disc --pool 30 --pad 2 --skip_gridding --j 2 --gpu "" --pipeline_control MultiBody/job381/
`which relion_flex_analyse` --PCA_orient --model MultiBody/job381/run_model.star --data MultiBody/job381/run_data.star --bodies 2-bodies-mask.star --o MultiBody/job381/analyse --do_maps --k 3 --pipeline_control MultiBody/job381/
Error message:
Here is the end of run.out.
Auto-refine: Estimated accuracy angles= 0.604 degrees; offsets= 0.40698 Angstroms
Body: 0 with rotational accuracy of 1.162 will be kept fixed
Auto-refine: Angular step= 0.46875 degrees; local searches= true
Auto-refine: Offset search range= 1.08358 Angstroms; offset step= 0.305235 Angstroms
[Rodin:12779] *** Process received signal ***
[Rodin:12779] Signal: Aborted (6)
[Rodin:12779] Signal code: (-6)
[Rodin:12779] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f57b2d4c8a0]
[Rodin:12779] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f57b1a0df47]
[Rodin:12779] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f57b1a0f8b1]
[Rodin:12779] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7f57b2631957]
[Rodin:12779] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ae6)[0x7f57b2637ae6]
[Rodin:12779] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92b21)[0x7f57b2637b21]
[Rodin:12779] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d54)[0x7f57b2637d54]
[Rodin:12779] [ 7] /home/zhu/relion-3.1/bin/relion_refine_mpi(_ZN7MpiNode16report_MPI_ERROREi+0x12a)[0x5612385ee7ca]
[Rodin:12779] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 0 on node Rodin exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Here is the run.err.
3: MPI_ERR_TRUNCATE: message truncated
3: MPI_ERR_TRUNCATE: message truncated
in: /home/zhu/relion-3.1/src/mpi.cpp, line 296
ERROR:
Encountered an MPI-related error, see above. Now exiting...
terminate called after throwing an instance of 'RelionError'
Thanks for your bug report. Until we fix it, please run with very tiny sigma offset and sigma angles (e.g. 0.01), which are effectively zero.
Thanks! The workaround is valid.
I cannot reproduce your problem. Can you show me your body STAR file? Does this happen in the first iteration, or later?
Here is my body STAR file.
data_
loop_
_rlnBodyMaskName
_rlnBodyRotateRelativeTo
_rlnBodySigmaAngles
_rlnBodySigmaOffset
_rlnBodyReferenceName
Mask-and-Ref/mask/IC_lp15_mask.mrc 2 15 3 PostProcess/job374/postprocess.mrc
Mask-and-Ref/mask/TM_lp15_mask.mrc 1 0 0 PostProcess/job374/postprocess.mrc
I just tested another run at a different machine. It happened at the iteration 8. Here is the run.err.
3: MPI_ERR_TRUNCATE: message truncated
3: MPI_ERR_TRUNCATE: message truncated
in: /scratch/local/nasapps/relion/src/mpi.cpp, line 296
ERROR:
Encountered an MPI-related error, see above. Now exiting...
=== Backtrace ===
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x4c) [0x44e0fc]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN7MpiNode15relion_MPI_RecvEPvlP15ompi_datatype_tiiP19ompi_communicator_tR20ompi_status_public_t+0x2d2) [0x4ca2e2]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN14MlOptimiserMpi22combineAllWeightedSumsEv+0x37c) [0x4953dc]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x1ab) [0x4899cb]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(main+0x7d) [0x43a26d]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f5ae0c65555]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi() [0x43a129]
==================
ERROR:
Encountered an MPI-related error, see above. Now exiting...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Does Combine iterations through disc?: Yes
in the Compute
tab help?
When I turned Combine iterations through disc?: Yes
, the Multibody refinement didn't crash, but it won't stop. Right now, it is over 200 iterations. I remembered that it happened to me before if multibody refinement crashed, I can start the process again by Continue
. Sometimes I can repeat Continue
a couple of times until the iteration reached 999, then the process crashed.
MPI error
Because I cannot reproduce your issue, I cannot help further. Recompiling with a newer version of OpenMPI might help.
No convergence
Look at these lines in run.out
.
Auto-refine: Resolution
Auto-refine: Changes in angles
Auto-refine: Estimated accuracy angles=
Auto-refine: Angular step=
For convergence, resolution and changes in angles should stop improving and the angular step must be less than 75 % of the estimated accuracy angles. If this keeps fluctuating, you can stop the run and continue with --force_converge
.
2 body?
First of all, running 2-body refinement with one body fixed is same as refinement with signal subtraction. There is no point using MultiBody refinement.
I agree that it is the same as signal subtraction. But multibody refinement seems to be easier to set up.
But computationally more demanding.