relion icon indicating copy to clipboard operation
relion copied to clipboard

Multibody refinement exited with an MPI-related error.

Open jianghaizhu opened this issue 3 years ago • 9 comments

Describe your problem

Multi-body refinement with 2 bodies. The SigmaAngles and SigmaOffset were set to 0 for the smaller domain. If they were not set to 0, everything runs just fine.


  • OS: Ubuntu 18.04
  • MPI runtime: OpenMPI 2.1.1
  • RELION version: RELION-3.1.0-commit-1349c5
  • Memory: 64 GB
  • GPU: 4 GeForce GTX TITAN X


  • Box size: 256 px
  • Pixel size: 1.33 Å/px
  • Number of particles: 50,000
  • Description: A tetrameric protein of about 500 kDa

Job options:

  • Type of job: MultiBody
  • Number of MPI processes: 5
  • Number of threads: 2
  • Full command (see note.txt in the job directory):
`which relion_refine_mpi` --continue Refine3D/job373/ --o MultiBody/job381/run --solvent_correct_fsc --multibody_masks --oversampling 1 --healpix_order 4 --auto_local_healpix_order 4 --offset_range 3 --offset_step 1.5 --reconstruct_subtracted_bodies  --dont_combine_weights_via_disc --pool 30 --pad 2  --skip_gridding  --j 2 --gpu ""  --pipeline_control MultiBody/job381/
`which relion_flex_analyse` --PCA_orient  --model MultiBody/job381/ --data MultiBody/job381/ --bodies --o MultiBody/job381/analyse --do_maps  --k 3 --pipeline_control MultiBody/job381/

Error message:

Here is the end of run.out.

 Auto-refine: Estimated accuracy angles= 0.604 degrees; offsets= 0.40698 Angstroms
 Body: 0 with rotational accuracy of 1.162 will be kept fixed 
 Auto-refine: Angular step= 0.46875 degrees; local searches= true
 Auto-refine: Offset search range= 1.08358 Angstroms; offset step= 0.305235 Angstroms
[Rodin:12779] *** Process received signal ***
[Rodin:12779] Signal: Aborted (6)
[Rodin:12779] Signal code:  (-6)
[Rodin:12779] [ 0] /lib/x86_64-linux-gnu/[0x7f57b2d4c8a0]
[Rodin:12779] [ 1] /lib/x86_64-linux-gnu/[0x7f57b1a0df47]
[Rodin:12779] [ 2] /lib/x86_64-linux-gnu/[0x7f57b1a0f8b1]
[Rodin:12779] [ 3] /usr/lib/x86_64-linux-gnu/[0x7f57b2631957]
[Rodin:12779] [ 4] /usr/lib/x86_64-linux-gnu/[0x7f57b2637ae6]
[Rodin:12779] [ 5] /usr/lib/x86_64-linux-gnu/[0x7f57b2637b21]
[Rodin:12779] [ 6] /usr/lib/x86_64-linux-gnu/[0x7f57b2637d54]
[Rodin:12779] [ 7] /home/zhu/relion-3.1/bin/relion_refine_mpi(_ZN7MpiNode16report_MPI_ERROREi+0x12a)[0x5612385ee7ca]
[Rodin:12779] *** End of error message ***
mpirun noticed that process rank 3 with PID 0 on node Rodin exited on signal 6 (Aborted).

Here is the run.err.

  3: MPI_ERR_TRUNCATE: message truncated
  3: MPI_ERR_TRUNCATE: message truncated
in: /home/zhu/relion-3.1/src/mpi.cpp, line 296
Encountered an MPI-related error, see above. Now exiting...
terminate called after throwing an instance of 'RelionError'

jianghaizhu avatar Sep 01 '20 19:09 jianghaizhu

Thanks for your bug report. Until we fix it, please run with very tiny sigma offset and sigma angles (e.g. 0.01), which are effectively zero.

biochem-fan avatar Sep 01 '20 20:09 biochem-fan

Thanks! The workaround is valid.

jianghaizhu avatar Sep 02 '20 02:09 jianghaizhu

I cannot reproduce your problem. Can you show me your body STAR file? Does this happen in the first iteration, or later?

biochem-fan avatar Nov 06 '20 08:11 biochem-fan

Here is my body STAR file.

Mask-and-Ref/mask/IC_lp15_mask.mrc 2   15    3 PostProcess/job374/postprocess.mrc
Mask-and-Ref/mask/TM_lp15_mask.mrc 1    0    0 PostProcess/job374/postprocess.mrc

I just tested another run at a different machine. It happened at the iteration 8. Here is the run.err.

  3: MPI_ERR_TRUNCATE: message truncated
  3: MPI_ERR_TRUNCATE: message truncated
in: /scratch/local/nasapps/relion/src/mpi.cpp, line 296
Encountered an MPI-related error, see above. Now exiting...
=== Backtrace  ===
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x4c) [0x44e0fc]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN7MpiNode15relion_MPI_RecvEPvlP15ompi_datatype_tiiP19ompi_communicator_tR20ompi_status_public_t+0x2d2) [0x4ca2e2]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN14MlOptimiserMpi22combineAllWeightedSumsEv+0x37c) [0x4953dc]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x1ab) [0x4899cb]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(main+0x7d) [0x43a26d]
/lib64/ [0x7f5ae0c65555]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi() [0x43a129]
Encountered an MPI-related error, see above. Now exiting...
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

jianghaizhu avatar Nov 12 '20 20:11 jianghaizhu

Does Combine iterations through disc?: Yes in the Compute tab help?

biochem-fan avatar Nov 12 '20 21:11 biochem-fan

When I turned Combine iterations through disc?: Yes, the Multibody refinement didn't crash, but it won't stop. Right now, it is over 200 iterations. I remembered that it happened to me before if multibody refinement crashed, I can start the process again by Continue. Sometimes I can repeat Continue a couple of times until the iteration reached 999, then the process crashed.

jianghaizhu avatar Nov 13 '20 14:11 jianghaizhu

MPI error

Because I cannot reproduce your issue, I cannot help further. Recompiling with a newer version of OpenMPI might help.

No convergence

Look at these lines in run.out.

Auto-refine: Resolution
Auto-refine: Changes in angles
Auto-refine: Estimated accuracy angles=
Auto-refine: Angular step=

For convergence, resolution and changes in angles should stop improving and the angular step must be less than 75 % of the estimated accuracy angles. If this keeps fluctuating, you can stop the run and continue with --force_converge.

2 body?

First of all, running 2-body refinement with one body fixed is same as refinement with signal subtraction. There is no point using MultiBody refinement.

biochem-fan avatar Nov 13 '20 16:11 biochem-fan

I agree that it is the same as signal subtraction. But multibody refinement seems to be easier to set up.

jianghaizhu avatar Nov 13 '20 16:11 jianghaizhu

But computationally more demanding.

biochem-fan avatar Nov 13 '20 16:11 biochem-fan