relion
relion copied to clipboard
Error with parts of the particle stack on the scratch disk
My particle stack is too large for the scratch disk, so only part of it can be transfered. But during the run, relion_refine_mpi does not find the particles correctly during the initial estimation of the noise spectra. See error messages below.
When omitting the --scratch_dir flag, things work as expected. Similar runs with smaller particle stacks copied to /scratch also works as expected.
Environment:
- OS: Ubuntu 16.04.7 LTS
- MPI runtime: OpenMPI 3.1.4
- RELION version RELION 3.1.0 (stable release)
- Memory: 128 GB
- Scratch: 170GB of SSD
Dataset:
- Box size: 512 px
- Pixel size: 0.82 Å/px
- Number of particles: 378,793
Job options:
- Type of job: Class3D (6 classes, no align)
- Number of MPI processes: 3
- Number of threads: 14
- Full command (see
note.txt
in the job directory):
srun relion_refine_mpi --o Class3D/job127/run --i Subtract/job124/particles_subtracted.star --ref Refine3D/job112/run_class001.mrc --firstiter_cc --ini_high 30 --dont_combine_weights_via_disc --pool 100 --pad 1 --skip_gridding --ctf --ctf_corrected_ref --iter 100 --tau2_fudge 40 --particle_diameter 320 --K 6 --flatten_solvent --solvent_mask MaskCreate/job035/mask_0.82Apix_512px.mrc --skip_align --sym C1 --norm --scale --j 14 --pipeline_control Class3D/job127/ --scratch_dir /scratch
Error message:
run.out:
RELION version: 3.1.0-commit-GITDIR
Precision: BASE=double, CUDA-ACC=single
=== RELION MPI setup ===
+ Number of MPI processes = 3
+ Number of threads per MPI process = 14
+ Total number of threads therefore = 42
+ Master (0) runs on host = b-cn0303.hpc2n.umu.se
+ Slave 1 runs on host = b-cn0303.hpc2n.umu.se
=================
+ Slave 2 runs on host = b-cn0303.hpc2n.umu.se
Running CPU instructions in double precision.
+ On host b-cn0303.hpc2n.umu.se: free scratch space = 166.72 Gb.
Copying particles to scratch directory: /scratch/relion_volatile/
24.58/24.58 min ............................................................~~(,_,">
For optics_group 1, there are 160481 particles on the scratch disk.
Estimating initial noise spectra
000/??? sec ~~(,_,"> [oo]
run.err:
Warning: scratch space full on b-cn0303.hpc2n.umu.se. Remaining 218312 particles will be read from where they were.
in: /scratch/eb-buildpath/RELION/3.1.0/fosscuda-2019b/relion-3.1.0/src/rwMRC.h, line 192
ERROR:
readMRC: Image number 189398 exceeds stack size 160481 of image 189398@/scratch/relion_volatile/opticsgroup1_particles.mrcs
=== Backtrace ===
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x5f) [0x48452f]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN5ImageIdE7readMRCElbRK8FileName+0x74c) [0x4b888c]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN5ImageIdE5_readERK8FileNameR13fImageHandlerblbb+0x1ec) [0x4bc33c]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN11MlOptimiser41calculateSumOfPowerSpectraAndAverageImageER13MultidimArrayIdEb+0x3c9) [0x600589]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi41calculateSumOfPowerSpectraAndAverageImageER13MultidimArrayIdE+0x2c) [0x49e7ac]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x971) [0x4a5d31]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(main+0x4a) [0x4723da]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x149a14f84840]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_start+0x29) [0x474d29]
==================
ERROR:
readMRC: Image number 189398 exceeds stack size 160481 of image 189398@/scratch/relion_volatile/opticsgroup1_particles.mrcs
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 11040933.0 ON b-cn0303 CANCELLED AT 2021-01-05T09:29:37 ***
srun: error: b-cn0303: task 0: Killed
srun: error: b-cn0303: task 1: Killed
srun: error: b-cn0303: task 2: Exited with exit code 1
I thought I fixed this issue at some point in 3.0.x. I am surprised to see this happening in 3.1.0. Now this is harder to debug, because this does not happen locally.
Just to make sure, can you try the latest commit in the ver3.1
branch?
This example is running at a facility, where I cannot easily recompile myself, so unfortunately, I cannot try the latest commit at the moment and I don't have things set up for testing on my local GPU workstation at the moment. I know this has happened to me before on my local machine, but perhaps that was in version 3.0.x.
at a facility, where I cannot easily recompile myself
You don't need root permission to compile RELION.
Just to follow up, version 3.1.1 seems to solve the issue for me.
I ran into the problem again. The change from before was that I set "Use parallel disc I/O" to "No". With the option set to "Yes" it works as intended.
Is this running over multiple nodes?
No, I run on a single node. These are the specs of the node:
Intel Xeon Gold 6132 (28 threads) 2 x NVidia V100 192 GB RAM Infiniband Network attached storage 166.72 GB SSD scratch (way to wimpy for cryo-EM needs...!)
I can fit 160k particles on the scratch and have to access the rest over the network.
We do plan to refactor the scratch system as it has many problems (e.g. #494).
I am afraid to say I might not be able to fix your problem until the refactoring, as I cannot reproduce this locally and this affects only small use cases (huge dataset with wimpy SSD).
This is understandable. I wanted to report back my finding in case it would help others with the same problem.