relion Error with parts of the particle stack on the scratch disk

Error with parts of the particle stack on the scratch disk

Open daniel-s-d-larsson opened this issue 3 years ago • 9 comments

My particle stack is too large for the scratch disk, so only part of it can be transfered. But during the run, relion_refine_mpi does not find the particles correctly during the initial estimation of the noise spectra. See error messages below.

When omitting the --scratch_dir flag, things work as expected. Similar runs with smaller particle stacks copied to /scratch also works as expected.

Environment:

OS: Ubuntu 16.04.7 LTS
MPI runtime: OpenMPI 3.1.4
RELION version RELION 3.1.0 (stable release)
Memory: 128 GB
Scratch: 170GB of SSD

Dataset:

Box size: 512 px
Pixel size: 0.82 Å/px
Number of particles: 378,793

Job options:

Type of job: Class3D (6 classes, no align)
Number of MPI processes: 3
Number of threads: 14
Full command (see note.txt in the job directory):

srun relion_refine_mpi --o Class3D/job127/run --i Subtract/job124/particles_subtracted.star --ref Refine3D/job112/run_class001.mrc --firstiter_cc --ini_high 30 --dont_combine_weights_via_disc --pool 100 --pad 1 --skip_gridding --ctf --ctf_corrected_ref --iter 100 --tau2_fudge 40 --particle_diameter 320 --K 6 --flatten_solvent --solvent_mask MaskCreate/job035/mask_0.82Apix_512px.mrc --skip_align --sym C1 --norm --scale --j 14 --pipeline_control Class3D/job127/ --scratch_dir /scratch

Error message:

run.out:

RELION version: 3.1.0-commit-GITDIR 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes             = 3
 + Number of threads per MPI process   = 14
 + Total number of threads therefore   = 42
 + Master  (0) runs on host            = b-cn0303.hpc2n.umu.se
 + Slave     1 runs on host            = b-cn0303.hpc2n.umu.se
 =================
 + Slave     2 runs on host            = b-cn0303.hpc2n.umu.se
 Running CPU instructions in double precision. 
 + On host b-cn0303.hpc2n.umu.se: free scratch space = 166.72 Gb.
 Copying particles to scratch directory: /scratch/relion_volatile/
24.58/24.58 min ............................................................~~(,_,">
 For optics_group 1, there are 160481 particles on the scratch disk.
 Estimating initial noise spectra 
000/??? sec ~~(,_,">                                                          [oo]

run.err:

 Warning: scratch space full on b-cn0303.hpc2n.umu.se. Remaining 218312 particles will be read from where they were.
in: /scratch/eb-buildpath/RELION/3.1.0/fosscuda-2019b/relion-3.1.0/src/rwMRC.h, line 192
ERROR: 
readMRC: Image number 189398 exceeds stack size 160481 of image 189398@/scratch/relion_volatile/opticsgroup1_particles.mrcs
=== Backtrace  ===
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x5f) [0x48452f]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN5ImageIdE7readMRCElbRK8FileName+0x74c) [0x4b888c]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN5ImageIdE5_readERK8FileNameR13fImageHandlerblbb+0x1ec) [0x4bc33c]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN11MlOptimiser41calculateSumOfPowerSpectraAndAverageImageER13MultidimArrayIdEb+0x3c9) [0x600589]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi41calculateSumOfPowerSpectraAndAverageImageER13MultidimArrayIdE+0x2c) [0x49e7ac]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x971) [0x4a5d31]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(main+0x4a) [0x4723da]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x149a14f84840]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_start+0x29) [0x474d29]
==================
ERROR: 
readMRC: Image number 189398 exceeds stack size 160481 of image 189398@/scratch/relion_volatile/opticsgroup1_particles.mrcs
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 11040933.0 ON b-cn0303 CANCELLED AT 2021-01-05T09:29:37 ***
srun: error: b-cn0303: task 0: Killed
srun: error: b-cn0303: task 1: Killed
srun: error: b-cn0303: task 2: Exited with exit code 1

Jan 05 '21 09:01 daniel-s-d-larsson

I thought I fixed this issue at some point in 3.0.x. I am surprised to see this happening in 3.1.0. Now this is harder to debug, because this does not happen locally.

Just to make sure, can you try the latest commit in the ver3.1 branch?

Jan 05 '21 10:01 biochem-fan

This example is running at a facility, where I cannot easily recompile myself, so unfortunately, I cannot try the latest commit at the moment and I don't have things set up for testing on my local GPU workstation at the moment. I know this has happened to me before on my local machine, but perhaps that was in version 3.0.x.

Jan 05 '21 10:01 daniel-s-d-larsson

at a facility, where I cannot easily recompile myself

You don't need root permission to compile RELION.

Jan 05 '21 10:01 biochem-fan

Just to follow up, version 3.1.1 seems to solve the issue for me.

Feb 04 '21 21:02 daniel-s-d-larsson

I ran into the problem again. The change from before was that I set "Use parallel disc I/O" to "No". With the option set to "Yes" it works as intended.

Feb 11 '21 18:02 daniel-s-d-larsson

Is this running over multiple nodes?

Feb 11 '21 18:02 biochem-fan

No, I run on a single node. These are the specs of the node:

Intel Xeon Gold 6132 (28 threads) 2 x NVidia V100 192 GB RAM Infiniband Network attached storage 166.72 GB SSD scratch (way to wimpy for cryo-EM needs...!)

I can fit 160k particles on the scratch and have to access the rest over the network.

Feb 11 '21 19:02 daniel-s-d-larsson

We do plan to refactor the scratch system as it has many problems (e.g. #494).

I am afraid to say I might not be able to fix your problem until the refactoring, as I cannot reproduce this locally and this affects only small use cases (huge dataset with wimpy SSD).

Feb 11 '21 21:02 biochem-fan

This is understandable. I wanted to report back my finding in case it would help others with the same problem.

Feb 11 '21 21:02 daniel-s-d-larsson

relion relion copied to clipboard

Error with parts of the particle stack on the scratch disk

relion
relion copied to clipboard