relion icon indicating copy to clipboard operation
relion copied to clipboard

semop lock error during 3D classification

Open DrJesseHansen opened this issue 6 months ago • 8 comments

Running this command interactively on a GPU node with two 2080Ti cards. This same error occurs when submiting to slurm cluster on our HPC.

running Relion 5 beta 3 commit 6331fe

command:

mpirun --np 5 --oversubscribe relion_refine_mpi --o Class3D/job055/run --ios Extract/job025/optimisation_set.star --gpu "" --ref InitialModel/box40_bin8_invert.mrc --firstiter_cc --trust_ref_size --ini_high 60 --dont_combine_weights_via_disc --pool 3 --pad 2 --ctf --iter 25 --tau2_fudge 1 --particle_diameter 440 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --j 1 --pipeline_control Class3D/job055/

error:



Expectation iteration 1 of 25
000/??? sec ~~(,_,">                                                          [oo]^Cjhansen@gpu148:/mnt/beegfs/schurgrp/jhansen/HTT/RELION5$ ^C
jhansen@gpu148:/mnt/beegfs/schurgrp/jhansen/HTT/RELION5$ ./07_classify1class.job 
RELION version: 5.0-beta-3-commit-6331fe 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes                 = 5
 + Leader      (0) runs on host            = gpu148
 + Follower     1  runs on host            = gpu148
 + Follower     2  runs on host            = gpu148
 + Follower     3  runs on host            = gpu148
 + Follower     4  runs on host            = gpu148
 ==========================
 uniqueHost gpu148 has 4 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 2 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 3 mapped to device 1
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 4 mapped to device 1
Device 0 on gpu148 is split between 2 followers
Device 1 on gpu148 is split between 2 followers
 Running CPU instructions in double precision. 
 WARNING:  The reference pixel size is 1 A/px, but the pixel size of the first optics group of the data is 11.056 A/px! 
WARNING: Although the requested resized pixel size is 11.056 A/px, the actual resized pixel size of the reference will be 10 A/px due to rounding of the box size to an even number. 
WARNING: Resizing input reference(s) to pixel_size= 10 and box size= 40 ...
 Estimating initial noise spectra from at most 10 particles 
   0/   0 sec ............................................................~~(,_,">
 CurrentResolution= 57.1429 Angstroms, which requires orientationSampling of at least 14.4 degrees for a particle of diameter 440 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 373248
 OrientationalSampling= 15 NrOrientations= 4608
 TranslationalSampling= 20 NrTranslations= 81
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 23887872
 OrientationalSampling= 7.5 NrOrientations= 36864
 TranslationalSampling= 10 NrTranslations= 648
=============================
 Expectation iteration 1 of 25
4.30/4.30 hrs ............................................................~~(,_,">
 Maximization...
   0/   0 sec ............................................................~~(,_,">
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
=== Backtrace  ===
=== Backtrace  ===
=== Backtrace  ===
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x55bb0ec3942a]
relion_refine_mpi(+0x5e60c) [0x55bb0eb8f60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x55bb0ee2cabb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x55bb0ee48a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x55bb0ec60069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x55bb0ec7710c]
relion_refine_mpi(main+0x52) [0x55bb0ec249c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f14bdc4624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f14bdc46305]
relion_refine_mpi(_start+0x21) [0x55bb0ec28251]
==================
ERROR: 
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56136c52842a]
relion_refine_mpi(+0x5e60c) [0x56136c47e60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x56136c71babb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x56136c737a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x56136c54f069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x56136c56610c]
relion_refine_mpi(main+0x52) [0x56136c5139c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f266aa4624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f266aa46305]
relion_refine_mpi(_start+0x21) [0x56136c517251]
==================
ERROR: 
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56542492742a]
relion_refine_mpi(+0x5e60c) [0x56542487d60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x565424b1aabb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x565424b36a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x56542494e069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x56542496510c]
relion_refine_mpi(main+0x52) [0x5654249129c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7faedd64624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7faedd646305]
relion_refine_mpi(_start+0x21) [0x565424916251]
==================
ERROR: 
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gpu148:295268] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[gpu148:295268] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



This is a template for reporting bugs. Please fill in as much information as you can.

Describe your problem

Please write a clear description of what the problem is. Data processing questions should be posted to the CCPEM mailing list, not here. DO NOT cross post a same question to multiple issues and/or many mailing lists (CCPEM, 3DEM, etc).

Environment:

  • OS: [e.g. Ubuntu 16.04 LTS]
  • MPI runtime: [e.g. OpenMPI 2.0.1]
  • RELION version [e.g. RELION-3.1-devel-commit-6ba935 (please see the title bar of the GUI)]
  • Memory: [e.g. 128 GB]
  • GPU: [e.g. GTX 1080Ti]

Dataset:

  • Box size: [e.g. 256 px]
  • Pixel size: [e.g. 0.9 Å/px]
  • Number of particles: [e.g. 150,000]
  • Description: [e.g. A tetrameric protein of about 400 kDa in total]

Job options:

  • Type of job: [e.g. Refine3D]
  • Number of MPI processes: [e.g. 4]
  • Number of threads: [e.g. 6]
  • Full command (see note.txt in the job directory):
    `which relion_refine_mpi` --o Refine3D/job019/run --auto_refine --split_random_halves --i CtfRefine/job018/particles_ctf_refine.star --ref PostProcess/job001/postprocess.mrc --firstiter_cc --ini_high 12 --dont_combine_weights_via_disc --scratch_dir /ssd --pool 3 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 142 --flatten_solvent --zero_mask --solvent_mask Result-by-Rado/run_class001_mask_th0.01_ns3_ngs7_box400.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 3 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym O --low_resol_join_halves 40 --norm --scale  --j 8 --gpu "" --keep_scratch --pipeline_control Refine3D/job019/
    

Error message:

Please cite the full error message as the example below.

A line in the STAR file contains fewer columns than the number of labels. Expected = 3 Found = 2
Error in line: 0 0.0
in: /prog/relion-devel-lmb/src/metadata_table.cpp, line 966
=== Backtrace  ===
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41) [0x42e981]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MetaDataTable12readStarLoopERSt14basic_ifstreamIcSt11char_traitsIcEEPSt6vectorI8EMDLabelSaIS6_EESsb+0xedd) [0x4361ad]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MetaDataTable8readStarERSt14basic_ifstreamIcSt11char_traitsIcEERKSsPSt6vectorI8EMDLabelSaIS8_EESsb+0x580) [0x436f10]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN10Micrograph4readE8FileNameb+0x5a3) [0x454bb3]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN10MicrographC2E8FileNameS0_d+0x2e3) [0x4568b3]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN17MicrographHandler14isMoviePresentERK13MetaDataTableb+0x180) [0x568280]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN17MicrographHandler17cullMissingMoviesERKSt6vectorI13MetaDataTableSaIS1_EEi+0xe6) [0x568dc6]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MotionRefiner4initEv+0x56f) [0x49e1ff]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(main+0x31) [0x42a5e1]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b7ac026e495]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi() [0x42b3cf]
==================

DrJesseHansen avatar Aug 23 '24 06:08 DrJesseHansen