relion icon indicating copy to clipboard operation
relion copied to clipboard

Error with parts of the particle stack on the scratch disk

Open daniel-s-d-larsson opened this issue 3 years ago • 9 comments

My particle stack is too large for the scratch disk, so only part of it can be transfered. But during the run, relion_refine_mpi does not find the particles correctly during the initial estimation of the noise spectra. See error messages below.

When omitting the --scratch_dir flag, things work as expected. Similar runs with smaller particle stacks copied to /scratch also works as expected.

Environment:

  • OS: Ubuntu 16.04.7 LTS
  • MPI runtime: OpenMPI 3.1.4
  • RELION version RELION 3.1.0 (stable release)
  • Memory: 128 GB
  • Scratch: 170GB of SSD

Dataset:

  • Box size: 512 px
  • Pixel size: 0.82 Å/px
  • Number of particles: 378,793

Job options:

  • Type of job: Class3D (6 classes, no align)
  • Number of MPI processes: 3
  • Number of threads: 14
  • Full command (see note.txt in the job directory):

srun relion_refine_mpi --o Class3D/job127/run --i Subtract/job124/particles_subtracted.star --ref Refine3D/job112/run_class001.mrc --firstiter_cc --ini_high 30 --dont_combine_weights_via_disc --pool 100 --pad 1 --skip_gridding --ctf --ctf_corrected_ref --iter 100 --tau2_fudge 40 --particle_diameter 320 --K 6 --flatten_solvent --solvent_mask MaskCreate/job035/mask_0.82Apix_512px.mrc --skip_align --sym C1 --norm --scale --j 14 --pipeline_control Class3D/job127/ --scratch_dir /scratch

Error message:

run.out:

RELION version: 3.1.0-commit-GITDIR 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes             = 3
 + Number of threads per MPI process   = 14
 + Total number of threads therefore   = 42
 + Master  (0) runs on host            = b-cn0303.hpc2n.umu.se
 + Slave     1 runs on host            = b-cn0303.hpc2n.umu.se
 =================
 + Slave     2 runs on host            = b-cn0303.hpc2n.umu.se
 Running CPU instructions in double precision. 
 + On host b-cn0303.hpc2n.umu.se: free scratch space = 166.72 Gb.
 Copying particles to scratch directory: /scratch/relion_volatile/
24.58/24.58 min ............................................................~~(,_,">
 For optics_group 1, there are 160481 particles on the scratch disk.
 Estimating initial noise spectra 
000/??? sec ~~(,_,">                                                          [oo]

run.err:

 Warning: scratch space full on b-cn0303.hpc2n.umu.se. Remaining 218312 particles will be read from where they were.
in: /scratch/eb-buildpath/RELION/3.1.0/fosscuda-2019b/relion-3.1.0/src/rwMRC.h, line 192
ERROR: 
readMRC: Image number 189398 exceeds stack size 160481 of image 189398@/scratch/relion_volatile/opticsgroup1_particles.mrcs
=== Backtrace  ===
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x5f) [0x48452f]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN5ImageIdE7readMRCElbRK8FileName+0x74c) [0x4b888c]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN5ImageIdE5_readERK8FileNameR13fImageHandlerblbb+0x1ec) [0x4bc33c]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN11MlOptimiser41calculateSumOfPowerSpectraAndAverageImageER13MultidimArrayIdEb+0x3c9) [0x600589]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi41calculateSumOfPowerSpectraAndAverageImageER13MultidimArrayIdE+0x2c) [0x49e7ac]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x971) [0x4a5d31]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(main+0x4a) [0x4723da]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x149a14f84840]
/hpc2n/eb/software/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/RELION/3.1.0/bin/relion_refine_mpi(_start+0x29) [0x474d29]
==================
ERROR: 
readMRC: Image number 189398 exceeds stack size 160481 of image 189398@/scratch/relion_volatile/opticsgroup1_particles.mrcs
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 11040933.0 ON b-cn0303 CANCELLED AT 2021-01-05T09:29:37 ***
srun: error: b-cn0303: task 0: Killed
srun: error: b-cn0303: task 1: Killed
srun: error: b-cn0303: task 2: Exited with exit code 1

daniel-s-d-larsson avatar Jan 05 '21 09:01 daniel-s-d-larsson

I thought I fixed this issue at some point in 3.0.x. I am surprised to see this happening in 3.1.0. Now this is harder to debug, because this does not happen locally.

Just to make sure, can you try the latest commit in the ver3.1 branch?

biochem-fan avatar Jan 05 '21 10:01 biochem-fan

This example is running at a facility, where I cannot easily recompile myself, so unfortunately, I cannot try the latest commit at the moment and I don't have things set up for testing on my local GPU workstation at the moment. I know this has happened to me before on my local machine, but perhaps that was in version 3.0.x.

daniel-s-d-larsson avatar Jan 05 '21 10:01 daniel-s-d-larsson

at a facility, where I cannot easily recompile myself

You don't need root permission to compile RELION.

biochem-fan avatar Jan 05 '21 10:01 biochem-fan

Just to follow up, version 3.1.1 seems to solve the issue for me.

daniel-s-d-larsson avatar Feb 04 '21 21:02 daniel-s-d-larsson

I ran into the problem again. The change from before was that I set "Use parallel disc I/O" to "No". With the option set to "Yes" it works as intended.

daniel-s-d-larsson avatar Feb 11 '21 18:02 daniel-s-d-larsson

Is this running over multiple nodes?

biochem-fan avatar Feb 11 '21 18:02 biochem-fan

No, I run on a single node. These are the specs of the node:

Intel Xeon Gold 6132 (28 threads) 2 x NVidia V100 192 GB RAM Infiniband Network attached storage 166.72 GB SSD scratch (way to wimpy for cryo-EM needs...!)

I can fit 160k particles on the scratch and have to access the rest over the network.

daniel-s-d-larsson avatar Feb 11 '21 19:02 daniel-s-d-larsson

We do plan to refactor the scratch system as it has many problems (e.g. #494).

I am afraid to say I might not be able to fix your problem until the refactoring, as I cannot reproduce this locally and this affects only small use cases (huge dataset with wimpy SSD).

biochem-fan avatar Feb 11 '21 21:02 biochem-fan

This is understandable. I wanted to report back my finding in case it would help others with the same problem.

daniel-s-d-larsson avatar Feb 11 '21 21:02 daniel-s-d-larsson