relion
relion copied to clipboard
semop lock error during 3D classification
Running this command interactively on a GPU node with two 2080Ti cards. This same error occurs when submiting to slurm cluster on our HPC.
running Relion 5 beta 3 commit 6331fe
command:
mpirun --np 5 --oversubscribe relion_refine_mpi --o Class3D/job055/run --ios Extract/job025/optimisation_set.star --gpu "" --ref InitialModel/box40_bin8_invert.mrc --firstiter_cc --trust_ref_size --ini_high 60 --dont_combine_weights_via_disc --pool 3 --pad 2 --ctf --iter 25 --tau2_fudge 1 --particle_diameter 440 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --j 1 --pipeline_control Class3D/job055/
error:
Expectation iteration 1 of 25
000/??? sec ~~(,_,"> [oo]^Cjhansen@gpu148:/mnt/beegfs/schurgrp/jhansen/HTT/RELION5$ ^C
jhansen@gpu148:/mnt/beegfs/schurgrp/jhansen/HTT/RELION5$ ./07_classify1class.job
RELION version: 5.0-beta-3-commit-6331fe
Precision: BASE=double, CUDA-ACC=single
=== RELION MPI setup ===
+ Number of MPI processes = 5
+ Leader (0) runs on host = gpu148
+ Follower 1 runs on host = gpu148
+ Follower 2 runs on host = gpu148
+ Follower 3 runs on host = gpu148
+ Follower 4 runs on host = gpu148
==========================
uniqueHost gpu148 has 4 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 2 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 3 mapped to device 1
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 4 mapped to device 1
Device 0 on gpu148 is split between 2 followers
Device 1 on gpu148 is split between 2 followers
Running CPU instructions in double precision.
WARNING: The reference pixel size is 1 A/px, but the pixel size of the first optics group of the data is 11.056 A/px!
WARNING: Although the requested resized pixel size is 11.056 A/px, the actual resized pixel size of the reference will be 10 A/px due to rounding of the box size to an even number.
WARNING: Resizing input reference(s) to pixel_size= 10 and box size= 40 ...
Estimating initial noise spectra from at most 10 particles
0/ 0 sec ............................................................~~(,_,">
CurrentResolution= 57.1429 Angstroms, which requires orientationSampling of at least 14.4 degrees for a particle of diameter 440 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 373248
OrientationalSampling= 15 NrOrientations= 4608
TranslationalSampling= 20 NrTranslations= 81
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 23887872
OrientationalSampling= 7.5 NrOrientations= 36864
TranslationalSampling= 10 NrTranslations= 648
=============================
Expectation iteration 1 of 25
4.30/4.30 hrs ............................................................~~(,_,">
Maximization...
0/ 0 sec ............................................................~~(,_,">
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR:
semop lock error
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR:
semop lock error
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR:
semop lock error
=== Backtrace ===
=== Backtrace ===
=== Backtrace ===
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x55bb0ec3942a]
relion_refine_mpi(+0x5e60c) [0x55bb0eb8f60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x55bb0ee2cabb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x55bb0ee48a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x55bb0ec60069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x55bb0ec7710c]
relion_refine_mpi(main+0x52) [0x55bb0ec249c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f14bdc4624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f14bdc46305]
relion_refine_mpi(_start+0x21) [0x55bb0ec28251]
==================
ERROR:
semop lock error
RELION version: 5.0-beta-3-commit-6331fe
exiting with an error ...
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56136c52842a]
relion_refine_mpi(+0x5e60c) [0x56136c47e60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x56136c71babb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x56136c737a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x56136c54f069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x56136c56610c]
relion_refine_mpi(main+0x52) [0x56136c5139c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f266aa4624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f266aa46305]
relion_refine_mpi(_start+0x21) [0x56136c517251]
==================
ERROR:
semop lock error
RELION version: 5.0-beta-3-commit-6331fe
exiting with an error ...
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56542492742a]
relion_refine_mpi(+0x5e60c) [0x56542487d60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x565424b1aabb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x565424b36a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x56542494e069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x56542496510c]
relion_refine_mpi(main+0x52) [0x5654249129c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7faedd64624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7faedd646305]
relion_refine_mpi(_start+0x21) [0x565424916251]
==================
ERROR:
semop lock error
RELION version: 5.0-beta-3-commit-6331fe
exiting with an error ...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gpu148:295268] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[gpu148:295268] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
This is a template for reporting bugs. Please fill in as much information as you can.
Describe your problem
Please write a clear description of what the problem is. Data processing questions should be posted to the CCPEM mailing list, not here. DO NOT cross post a same question to multiple issues and/or many mailing lists (CCPEM, 3DEM, etc).
Environment:
- OS: [e.g. Ubuntu 16.04 LTS]
- MPI runtime: [e.g. OpenMPI 2.0.1]
- RELION version [e.g. RELION-3.1-devel-commit-6ba935 (please see the title bar of the GUI)]
- Memory: [e.g. 128 GB]
- GPU: [e.g. GTX 1080Ti]
Dataset:
- Box size: [e.g. 256 px]
- Pixel size: [e.g. 0.9 Å/px]
- Number of particles: [e.g. 150,000]
- Description: [e.g. A tetrameric protein of about 400 kDa in total]
Job options:
- Type of job: [e.g. Refine3D]
- Number of MPI processes: [e.g. 4]
- Number of threads: [e.g. 6]
- Full command (see
note.txt
in the job directory):`which relion_refine_mpi` --o Refine3D/job019/run --auto_refine --split_random_halves --i CtfRefine/job018/particles_ctf_refine.star --ref PostProcess/job001/postprocess.mrc --firstiter_cc --ini_high 12 --dont_combine_weights_via_disc --scratch_dir /ssd --pool 3 --pad 2 --ctf --ctf_corrected_ref --particle_diameter 142 --flatten_solvent --zero_mask --solvent_mask Result-by-Rado/run_class001_mask_th0.01_ns3_ngs7_box400.mrc --solvent_correct_fsc --oversampling 1 --healpix_order 3 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym O --low_resol_join_halves 40 --norm --scale --j 8 --gpu "" --keep_scratch --pipeline_control Refine3D/job019/
Error message:
Please cite the full error message as the example below.
A line in the STAR file contains fewer columns than the number of labels. Expected = 3 Found = 2
Error in line: 0 0.0
in: /prog/relion-devel-lmb/src/metadata_table.cpp, line 966
=== Backtrace ===
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41) [0x42e981]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MetaDataTable12readStarLoopERSt14basic_ifstreamIcSt11char_traitsIcEEPSt6vectorI8EMDLabelSaIS6_EESsb+0xedd) [0x4361ad]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MetaDataTable8readStarERSt14basic_ifstreamIcSt11char_traitsIcEERKSsPSt6vectorI8EMDLabelSaIS8_EESsb+0x580) [0x436f10]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN10Micrograph4readE8FileNameb+0x5a3) [0x454bb3]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN10MicrographC2E8FileNameS0_d+0x2e3) [0x4568b3]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN17MicrographHandler14isMoviePresentERK13MetaDataTableb+0x180) [0x568280]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN17MicrographHandler17cullMissingMoviesERKSt6vectorI13MetaDataTableSaIS1_EEi+0xe6) [0x568dc6]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MotionRefiner4initEv+0x56f) [0x49e1ff]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(main+0x31) [0x42a5e1]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b7ac026e495]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi() [0x42b3cf]
==================