relion icon indicating copy to clipboard operation
relion copied to clipboard

Multibody refinement exited with an MPI-related error.

Open jianghaizhu opened this issue 3 years ago • 9 comments

Describe your problem

Multi-body refinement with 2 bodies. The SigmaAngles and SigmaOffset were set to 0 for the smaller domain. If they were not set to 0, everything runs just fine.

Environment:

  • OS: Ubuntu 18.04
  • MPI runtime: OpenMPI 2.1.1
  • RELION version: RELION-3.1.0-commit-1349c5
  • Memory: 64 GB
  • GPU: 4 GeForce GTX TITAN X

Dataset:

  • Box size: 256 px
  • Pixel size: 1.33 Å/px
  • Number of particles: 50,000
  • Description: A tetrameric protein of about 500 kDa

Job options:

  • Type of job: MultiBody
  • Number of MPI processes: 5
  • Number of threads: 2
  • Full command (see note.txt in the job directory):
`which relion_refine_mpi` --continue Refine3D/job373/run_it018_optimiser.star --o MultiBody/job381/run --solvent_correct_fsc --multibody_masks 2-bodies-mask.star --oversampling 1 --healpix_order 4 --auto_local_healpix_order 4 --offset_range 3 --offset_step 1.5 --reconstruct_subtracted_bodies  --dont_combine_weights_via_disc --pool 30 --pad 2  --skip_gridding  --j 2 --gpu ""  --pipeline_control MultiBody/job381/
`which relion_flex_analyse` --PCA_orient  --model MultiBody/job381/run_model.star --data MultiBody/job381/run_data.star --bodies 2-bodies-mask.star --o MultiBody/job381/analyse --do_maps  --k 3 --pipeline_control MultiBody/job381/
  

Error message:

Here is the end of run.out.

 Auto-refine: Estimated accuracy angles= 0.604 degrees; offsets= 0.40698 Angstroms
 Body: 0 with rotational accuracy of 1.162 will be kept fixed 
 Auto-refine: Angular step= 0.46875 degrees; local searches= true
 Auto-refine: Offset search range= 1.08358 Angstroms; offset step= 0.305235 Angstroms
[Rodin:12779] *** Process received signal ***
[Rodin:12779] Signal: Aborted (6)
[Rodin:12779] Signal code:  (-6)
[Rodin:12779] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f57b2d4c8a0]
[Rodin:12779] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f57b1a0df47]
[Rodin:12779] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f57b1a0f8b1]
[Rodin:12779] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7f57b2631957]
[Rodin:12779] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ae6)[0x7f57b2637ae6]
[Rodin:12779] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92b21)[0x7f57b2637b21]
[Rodin:12779] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d54)[0x7f57b2637d54]
[Rodin:12779] [ 7] /home/zhu/relion-3.1/bin/relion_refine_mpi(_ZN7MpiNode16report_MPI_ERROREi+0x12a)[0x5612385ee7ca]
[Rodin:12779] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 0 on node Rodin exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Here is the run.err.

  3: MPI_ERR_TRUNCATE: message truncated
  3: MPI_ERR_TRUNCATE: message truncated
in: /home/zhu/relion-3.1/src/mpi.cpp, line 296
ERROR: 
Encountered an MPI-related error, see above. Now exiting...
terminate called after throwing an instance of 'RelionError'

jianghaizhu avatar Sep 01 '20 19:09 jianghaizhu

Thanks for your bug report. Until we fix it, please run with very tiny sigma offset and sigma angles (e.g. 0.01), which are effectively zero.

biochem-fan avatar Sep 01 '20 20:09 biochem-fan

Thanks! The workaround is valid.

jianghaizhu avatar Sep 02 '20 02:09 jianghaizhu

I cannot reproduce your problem. Can you show me your body STAR file? Does this happen in the first iteration, or later?

biochem-fan avatar Nov 06 '20 08:11 biochem-fan

Here is my body STAR file.

data_
loop_
_rlnBodyMaskName
_rlnBodyRotateRelativeTo
_rlnBodySigmaAngles
_rlnBodySigmaOffset
_rlnBodyReferenceName
Mask-and-Ref/mask/IC_lp15_mask.mrc 2   15    3 PostProcess/job374/postprocess.mrc
Mask-and-Ref/mask/TM_lp15_mask.mrc 1    0    0 PostProcess/job374/postprocess.mrc

I just tested another run at a different machine. It happened at the iteration 8. Here is the run.err.

  3: MPI_ERR_TRUNCATE: message truncated
  3: MPI_ERR_TRUNCATE: message truncated
in: /scratch/local/nasapps/relion/src/mpi.cpp, line 296
ERROR: 
Encountered an MPI-related error, see above. Now exiting...
=== Backtrace  ===
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x4c) [0x44e0fc]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN7MpiNode15relion_MPI_RecvEPvlP15ompi_datatype_tiiP19ompi_communicator_tR20ompi_status_public_t+0x2d2) [0x4ca2e2]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN14MlOptimiserMpi22combineAllWeightedSumsEv+0x37c) [0x4953dc]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x1ab) [0x4899cb]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi(main+0x7d) [0x43a26d]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f5ae0c65555]
/mnt/nasapps/production/relion/3.1/bin/relion_refine_mpi() [0x43a129]
==================
ERROR: 
Encountered an MPI-related error, see above. Now exiting...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

jianghaizhu avatar Nov 12 '20 20:11 jianghaizhu

Does Combine iterations through disc?: Yes in the Compute tab help?

biochem-fan avatar Nov 12 '20 21:11 biochem-fan

When I turned Combine iterations through disc?: Yes, the Multibody refinement didn't crash, but it won't stop. Right now, it is over 200 iterations. I remembered that it happened to me before if multibody refinement crashed, I can start the process again by Continue. Sometimes I can repeat Continue a couple of times until the iteration reached 999, then the process crashed.

jianghaizhu avatar Nov 13 '20 14:11 jianghaizhu

MPI error

Because I cannot reproduce your issue, I cannot help further. Recompiling with a newer version of OpenMPI might help.

No convergence

Look at these lines in run.out.

Auto-refine: Resolution
Auto-refine: Changes in angles
Auto-refine: Estimated accuracy angles=
Auto-refine: Angular step=

For convergence, resolution and changes in angles should stop improving and the angular step must be less than 75 % of the estimated accuracy angles. If this keeps fluctuating, you can stop the run and continue with --force_converge.

2 body?

First of all, running 2-body refinement with one body fixed is same as refinement with signal subtraction. There is no point using MultiBody refinement.

biochem-fan avatar Nov 13 '20 16:11 biochem-fan

I agree that it is the same as signal subtraction. But multibody refinement seems to be easier to set up.

jianghaizhu avatar Nov 13 '20 16:11 jianghaizhu

But computationally more demanding.

biochem-fan avatar Nov 13 '20 16:11 biochem-fan