relion
relion copied to clipboard
Segmentation fault: address not mapped to object at address 0x80
Hi,
I am running 3d auto refine on 2D particles from tomograms (tomo pipeline with extracting 2D particles). I have stayed within the RELION pipeline and indeed everything works well. No issues. However, I am also running the same dataset though the new Linux Warp pipelines in parallel. I extract the 2D particles in WARP and when I run any job in RELION I get the segmentation error below. I've tried 3D classification with 1 class and 3D autorefine. I've tried reducing memory requirements as much as possible: pad set to 1, translational search of only 2 pixels, and reduced the mpi to only 2 processes. See my command below. I have 60k particles, the box size is 40x40. I am running RELION 5 -- beta 3.
This is running on a cluster compute environment on two Nvidia H100 (SXM5 80GB) so I think GPU memory should not be an issue. I have allocated 200GB CPU memory and am measuring CPU memory during the job: it never goes over 90GB or so. I am perplexed why this is happening. I checked the image stats for the output particles and they are both the same map mode (flaot16) but of course the min/max are way different, due to WARP vs RELION extraction. Could this be the issue? Any idea what might be causing this?
My command is below:
#!/bin/bash
#SBATCH --ntasks=3
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --time=239:00:00
#SBATCH --mem=200G
#SBATCH --partition=gpu100
#SBATCH --gres=gpu:2
#SBATCH --export=NONE
cd $SLURM_SUBMIT_DIR
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module purge
module load relion/5-beta6
unset SLURM_EXPORT_ENV
# Create necessary directories
mkdir -p Refine3D/job001_local_3_redo
# Run Relion refine process with MPI
mpirun -n 3 `which relion_refine_mpi` \
--o Refine3D/job001_local_3_redo/run \
--auto_refine \
--split_random_halves \
--firstiter_cc \
--ios reextracted_bin8_3D_optimisation_set.star \
--ref InitialModel/recon.mrc \
--trust_ref_size \
--ini_high 40 \
--dont_combine_weights_via_disc \
--pool 10 \
--pad 1 \
--ctf \
--particle_diameter 400 \
--flatten_solvent \
--zero_mask \
--oversampling 1 \
--healpix_order 3 \
--auto_local_healpix_order 3 \
--offset_range 2 \
--offset_step 2 \
--sym C1 \
--low_resol_join_halves 40 \
--norm \
--scale \
--j 1 \
--gpu ""
The error I am receiving:
Auto-refine: Iteration= 1
Auto-refine: Resolution= 40.2036 (no gain for 0 iter)
Auto-refine: Changes in angles= 999 degrees; and in offsets= 999 Angstroms (no gain for 0 iter)
Estimating accuracies in the orientational assignment ...
3/ 3 sec ............................................................~~(,_,">
Auto-refine: Estimated accuracy angles= 1.484 degrees; offsets= 3.89171 Angstroms
CurrentResolution= 40.2036 Angstroms, which requires orientationSampling of at least 11.25 degrees for a particle of diameter 400 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 945
OrientationalSampling= 7.5 NrOrientations= 135
TranslationalSampling= 22.112 NrTranslations= 7
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 60480
OrientationalSampling= 3.75 NrOrientations= 1080
TranslationalSampling= 11.056 NrTranslations= 56
=============================
Expectation iteration 1
7.45/40.35 min ...........~~(,_,">[gpu271:3904135:0:3904135] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x80)
==== backtrace (tid:3904135) ====
0 0x000000000003c050 __sigaction() ???:0
1 0x00000000003d58ff getAllSquaredDifferencesCoarse<MlOptimiserCuda>() tmpxft_003a2465_00000000-6_cuda_ml_optimiser.cudafe1.cpp:0
2 0x00000000003d9fc4 accDoExpectationOneParticle<MlOptimiserCuda>() tmpxft_003a2465_00000000-6_cuda_ml_optimiser.cudafe1.cpp:0
3 0x00000000003db852 MlOptimiserCuda::doThreadExpectationSomeParticles() ???:0
4 0x000000000036b96f globalThreadExpectationSomeParticles() ???:0
5 0x000000000036b9e5 MlOptimiser::expectationSomeParticles() ml_optimiser.cpp:0
6 0x00000000000140b6 GOMP_parallel() ???:0
7 0x0000000000358a6e MlOptimiser::expectationSomeParticles() ???:0
8 0x0000000000130bad MlOptimiserMpi::expectation() ???:0
9 0x000000000014610c MlOptimiserMpi::iterate() ???:0
10 0x00000000000f39c2 main() ???:0
11 0x000000000002724a __libc_init_first() ???:0
12 0x0000000000027305 __libc_start_main() ???:0
13 0x00000000000f7251 _start() ???:0
=================================
[gpu271:3904135] *** Process received signal ***
[gpu271:3904135] Signal: Segmentation fault (11)
[gpu271:3904135] Signal code: (-6)
[gpu271:3904135] Failing at address: 0xf57ae003b9287
[gpu271:3904135] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x14e786e13050]
[gpu271:3904135] [ 1] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(+0x3d58ff)[0x5581b98b58ff]
[gpu271:3904135] [ 2] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(+0x3d9fc4)[0x5581b98b9fc4]
[gpu271:3904135] [ 3] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2)[0x5581b98bb852]
[gpu271:3904135] [ 4] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f)[0x5581b984b96f]
[gpu271:3904135] [ 5] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(+0x36b9e5)[0x5581b984b9e5]
[gpu271:3904135] [ 6] /lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x46)[0x14e786fcc0b6]
[gpu271:3904135] [ 7] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN11MlOptimiser24expectationSomeParticlesEll+0xd5e)[0x5581b9838a6e]
[gpu271:3904135] [ 8] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x1f2d)[0x5581b9610bad]
[gpu271:3904135] [ 9] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc)[0x5581b962610c]
[gpu271:3904135] [10] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(main+0x52)[0x5581b95d39c2]
[gpu271:3904135] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x14e786dfe24a]
[gpu271:3904135] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x14e786dfe305]
[gpu271:3904135] [13] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_start+0x21)[0x5581b95d7251]
[gpu271:3904135] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 3904135 on node gpu271 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Thanks!