DrJesseHansen opened this issue 6 months ago

Running this command interactively on a GPU node with two 2080Ti cards. This same error occurs when submiting to slurm cluster on our HPC.

running Relion 5 beta 3 commit 6331fe


mpirun --np 5 --oversubscribe relion_refine_mpi --o Class3D/job055/run --ios Extract/job025/ --gpu "" --ref InitialModel/box40_bin8_invert.mrc --firstiter_cc --trust_ref_size --ini_high 60 --dont_combine_weights_via_disc --pool 3 --pad 2 --ctf --iter 25 --tau2_fudge 1 --particle_diameter 440 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --j 1 --pipeline_control Class3D/job055/


Expectation iteration 1 of 25
000/??? sec ~~(,_,">                                                          [oo]^Cjhansen@gpu148:/mnt/beegfs/schurgrp/jhansen/HTT/RELION5$ ^C
jhansen@gpu148:/mnt/beegfs/schurgrp/jhansen/HTT/RELION5$ ./07_classify1class.job 
RELION version: 5.0-beta-3-commit-6331fe 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes                 = 5
 + Leader      (0) runs on host            = gpu148
 + Follower     1  runs on host            = gpu148
 + Follower     2  runs on host            = gpu148
 + Follower     3  runs on host            = gpu148
 + Follower     4  runs on host            = gpu148
 uniqueHost gpu148 has 4 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 2 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 3 mapped to device 1
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 4 mapped to device 1
Device 0 on gpu148 is split between 2 followers
Device 1 on gpu148 is split between 2 followers
 Running CPU instructions in double precision. 
 WARNING:  The reference pixel size is 1 A/px, but the pixel size of the first optics group of the data is 11.056 A/px! 
WARNING: Although the requested resized pixel size is 11.056 A/px, the actual resized pixel size of the reference will be 10 A/px due to rounding of the box size to an even number. 
WARNING: Resizing input reference(s) to pixel_size= 10 and box size= 40 ...
 Estimating initial noise spectra from at most 10 particles 
   0/   0 sec ............................................................~~(,_,">
 CurrentResolution= 57.1429 Angstroms, which requires orientationSampling of at least 14.4 degrees for a particle of diameter 440 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 373248
 OrientationalSampling= 15 NrOrientations= 4608
 TranslationalSampling= 20 NrTranslations= 81
 Oversampling= 1 NrHiddenVariableSamplingPoints= 23887872
 OrientationalSampling= 7.5 NrOrientations= 36864
 TranslationalSampling= 10 NrTranslations= 648
 Expectation iteration 1 of 25
4.30/4.30 hrs ............................................................~~(,_,">
   0/   0 sec ............................................................~~(,_,">
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
semop lock error
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
semop lock error
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
semop lock error
=== Backtrace  ===
=== Backtrace  ===
=== Backtrace  ===
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x55bb0ec3942a]
relion_refine_mpi(+0x5e60c) [0x55bb0eb8f60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x55bb0ee2cabb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x55bb0ee48a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x55bb0ec60069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x55bb0ec7710c]
relion_refine_mpi(main+0x52) [0x55bb0ec249c2]
/lib/x86_64-linux-gnu/ [0x7f14bdc4624a]
/lib/x86_64-linux-gnu/ [0x7f14bdc46305]
relion_refine_mpi(_start+0x21) [0x55bb0ec28251]
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56136c52842a]
relion_refine_mpi(+0x5e60c) [0x56136c47e60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x56136c71babb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x56136c737a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x56136c54f069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x56136c56610c]
relion_refine_mpi(main+0x52) [0x56136c5139c2]
/lib/x86_64-linux-gnu/ [0x7f266aa4624a]
/lib/x86_64-linux-gnu/ [0x7f266aa46305]
relion_refine_mpi(_start+0x21) [0x56136c517251]
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56542492742a]
relion_refine_mpi(+0x5e60c) [0x56542487d60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x565424b1aabb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x565424b36a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x56542494e069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x56542496510c]
relion_refine_mpi(main+0x52) [0x5654249129c2]
/lib/x86_64-linux-gnu/ [0x7faedd64624a]
/lib/x86_64-linux-gnu/ [0x7faedd646305]
relion_refine_mpi(_start+0x21) [0x565424916251]
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
[gpu148:295268] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[gpu148:295268] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

DrJesseHansen avatar Aug 23 '24 06:08 DrJesseHansen