relion icon indicating copy to clipboard operation
relion copied to clipboard

Segmentation fault: address not mapped to object at address 0x80

Open DrJesseHansen opened this issue 5 months ago • 0 comments

Hi,

I am running 3d auto refine on 2D particles from tomograms (tomo pipeline with extracting 2D particles). I have stayed within the RELION pipeline and indeed everything works well. No issues. However, I am also running the same dataset though the new Linux Warp pipelines in parallel. I extract the 2D particles in WARP and when I run any job in RELION I get the segmentation error below. I've tried 3D classification with 1 class and 3D autorefine. I've tried reducing memory requirements as much as possible: pad set to 1, translational search of only 2 pixels, and reduced the mpi to only 2 processes. See my command below. I have 60k particles, the box size is 40x40. I am running RELION 5 -- beta 3.

This is running on a cluster compute environment on two Nvidia H100 (SXM5 80GB) so I think GPU memory should not be an issue. I have allocated 200GB CPU memory and am measuring CPU memory during the job: it never goes over 90GB or so. I am perplexed why this is happening. I checked the image stats for the output particles and they are both the same map mode (flaot16) but of course the min/max are way different, due to WARP vs RELION extraction. Could this be the issue? Any idea what might be causing this?

My command is below:

#!/bin/bash
#SBATCH --ntasks=3
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --time=239:00:00
#SBATCH --mem=200G
#SBATCH --partition=gpu100
#SBATCH --gres=gpu:2
#SBATCH --export=NONE

cd $SLURM_SUBMIT_DIR

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module purge
module load relion/5-beta6
unset SLURM_EXPORT_ENV

# Create necessary directories
mkdir -p Refine3D/job001_local_3_redo

# Run Relion refine process with MPI
mpirun -n 3 `which relion_refine_mpi` \
--o Refine3D/job001_local_3_redo/run \
--auto_refine \
--split_random_halves \
--firstiter_cc \
--ios reextracted_bin8_3D_optimisation_set.star \
--ref InitialModel/recon.mrc \
--trust_ref_size \
--ini_high 40 \
--dont_combine_weights_via_disc \
--pool 10 \
--pad 1  \
--ctf \
--particle_diameter 400 \
--flatten_solvent \
--zero_mask \
--oversampling 1 \
--healpix_order 3 \
--auto_local_healpix_order 3 \
--offset_range 2 \
--offset_step 2 \
--sym C1 \
--low_resol_join_halves 40 \
--norm \
--scale  \
--j 1 \
--gpu ""   

The error I am receiving:

Auto-refine: Iteration= 1
 Auto-refine: Resolution= 40.2036 (no gain for 0 iter) 
 Auto-refine: Changes in angles= 999 degrees; and in offsets= 999 Angstroms (no gain for 0 iter) 
 Estimating accuracies in the orientational assignment ... 
   3/   3 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 1.484 degrees; offsets= 3.89171 Angstroms
 CurrentResolution= 40.2036 Angstroms, which requires orientationSampling of at least 11.25 degrees for a particle of diameter 400 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 945
 OrientationalSampling= 7.5 NrOrientations= 135
 TranslationalSampling= 22.112 NrTranslations= 7
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 60480
 OrientationalSampling= 3.75 NrOrientations= 1080
 TranslationalSampling= 11.056 NrTranslations= 56
=============================
 Expectation iteration 1
7.45/40.35 min ...........~~(,_,">[gpu271:3904135:0:3904135] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x80)
==== backtrace (tid:3904135) ====
 0 0x000000000003c050 __sigaction()  ???:0
 1 0x00000000003d58ff getAllSquaredDifferencesCoarse<MlOptimiserCuda>()  tmpxft_003a2465_00000000-6_cuda_ml_optimiser.cudafe1.cpp:0
 2 0x00000000003d9fc4 accDoExpectationOneParticle<MlOptimiserCuda>()  tmpxft_003a2465_00000000-6_cuda_ml_optimiser.cudafe1.cpp:0
 3 0x00000000003db852 MlOptimiserCuda::doThreadExpectationSomeParticles()  ???:0
 4 0x000000000036b96f globalThreadExpectationSomeParticles()  ???:0
 5 0x000000000036b9e5 MlOptimiser::expectationSomeParticles()  ml_optimiser.cpp:0
 6 0x00000000000140b6 GOMP_parallel()  ???:0
 7 0x0000000000358a6e MlOptimiser::expectationSomeParticles()  ???:0
 8 0x0000000000130bad MlOptimiserMpi::expectation()  ???:0
 9 0x000000000014610c MlOptimiserMpi::iterate()  ???:0
10 0x00000000000f39c2 main()  ???:0
11 0x000000000002724a __libc_init_first()  ???:0
12 0x0000000000027305 __libc_start_main()  ???:0
13 0x00000000000f7251 _start()  ???:0
=================================
[gpu271:3904135] *** Process received signal ***
[gpu271:3904135] Signal: Segmentation fault (11)
[gpu271:3904135] Signal code:  (-6)
[gpu271:3904135] Failing at address: 0xf57ae003b9287
[gpu271:3904135] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x14e786e13050]
[gpu271:3904135] [ 1] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(+0x3d58ff)[0x5581b98b58ff]
[gpu271:3904135] [ 2] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(+0x3d9fc4)[0x5581b98b9fc4]
[gpu271:3904135] [ 3] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2)[0x5581b98bb852]
[gpu271:3904135] [ 4] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f)[0x5581b984b96f]
[gpu271:3904135] [ 5] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(+0x36b9e5)[0x5581b984b9e5]
[gpu271:3904135] [ 6] /lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x46)[0x14e786fcc0b6]
[gpu271:3904135] [ 7] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN11MlOptimiser24expectationSomeParticlesEll+0xd5e)[0x5581b9838a6e]
[gpu271:3904135] [ 8] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x1f2d)[0x5581b9610bad]
[gpu271:3904135] [ 9] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc)[0x5581b962610c]
[gpu271:3904135] [10] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(main+0x52)[0x5581b95d39c2]
[gpu271:3904135] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x14e786dfe24a]
[gpu271:3904135] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x14e786dfe305]
[gpu271:3904135] [13] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_start+0x21)[0x5581b95d7251]
[gpu271:3904135] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 3904135 on node gpu271 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Thanks!

DrJesseHansen avatar Aug 30 '24 14:08 DrJesseHansen