relion icon indicating copy to clipboard operation
relion copied to clipboard

Issues with trying to get a cryoSPARC dataset running in RELION

Open mbelouso opened this issue 7 months ago • 11 comments

So I hate to write this one, but many of my students are refusing to use RELION, and I'm not convinced the results that they get from other software is sometimes the optimal solution. So I was trying to get a dataset running in RELION after it has been fully run through in cryoSPARC.

Anyway, the issue is that it seems to be fine when the same STAR file is run through C2D but fails in C3D. I suspect that there is something up in how I cooked up the STAR file, like do I need to have the _rlnCtfDataAreCtfPremultiplied flag? I did try to add that in, but it didn't help. It also seems to be consistent across different machines.

Essentially it makes it through the first iteration and either fails at the maximization step or the expectation step in the 2nd iteration.

Any help would be appreciated, as this issue is going to get more and more as I think RELION still does the best job and many people now just process everything start to finish in cryoSPARC, so I would like to get the bottom of it.

Environment:

  • OS: Mint 22.1
  • MPI runtime: (Open MPI) 4.1.6
  • RELION version: RELION version: 5.0.0-commit-1fdfb9 Precision: BASE=double, CUDA-ACC=single
  • Memory: [512 GB]
  • GPU: [RTX4000 ada]

Dataset:

  • Box size: [e.g. 360 px]
  • Pixel size: [e.g. 0.733 Å/px]
  • Number of particles: [~500,000]
  • Description: [GPCR]

Job options:

  • Type of job: [C3D]
  • Number of MPI processes: [5]
  • Number of threads: [1]
relion_refine_mpi --o Class3D/job001/run --i J55_newoptics.star --ref J55_010_volume_map.mrc --firstiter_cc --trust_ref_size --ini_high 16 --dont_combine_weights_via_disc --pool 30 --pad 1  --ctf --iter 25 --tau2_fudge 4 --particle_diameter 170 --fast_subsets  --K 3 --flatten_solvent --zero_mask --strict_highres_exp 4 --blush  --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale  --j 1 --gpu ""  --pipeline_control Class3D/job001/

Error message:

munmap_chunk(): invalid pointer
[piastri:1781041] *** Process received signal ***
corrupted double-linked list
[piastri:1781043] *** Process received signal ***
[piastri:1781043] Signal: Aborted (6)
[piastri:1781043] Signal code:  (-6)
[piastri:1781041] Signal: Aborted (6)
[piastri:1781041] Signal code:  (-6)
[piastri:1781041] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7936c9c45330]
[piastri:1781041] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7936c9c9eb2c]
[piastri:1781041] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7936c9c4527e]
[piastri:1781041] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7936c9c288ff]
[piastri:1781041] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x7936c9c297b6]
[piastri:1781041] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0xa8ff5)[0x7936c9ca8ff5]
[piastri:1781041] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0xa947c)[0x7936c9ca947c]
[piastri:1781041] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_free+0xca)[0x7936c9caddfa]
[piastri:1781041] [ 8] /apps/relion/build/bin/relion_refine_mpi(_ZN13MultidimArrayIdE14coreDeallocateEv+0x65)[0x6513f6f9dd65]
[piastri:1781041] [ 9] /apps/relion/build/bin/relion_refine_mpi(_ZN11MlWsumModel4packER13MultidimArrayIdERiS3_b+0x53a)[0x6513f718ef9a]
[piastri:1781041] [10] /apps/relion/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi22combineAllWeightedSumsEv+0x624)[0x6513f6fc7e34]
[piastri:1781041] [11] /apps/relion/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x665)[0x6513f6fe2c75]
[piastri:1781041] [12] /apps/relion/build/bin/relion_refine_mpi(main+0x81)[0x6513f6f8f021]
[piastri:1781041] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7936c9c2a1ca]
[piastri:1781041] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7936c9c2a28b]
[piastri:1781041] [15] /apps/relion/build/bin/relion_refine_mpi(_start+0x25)[0x6513f6f92385]
[piastri:1781041] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 1781041 on node piastri exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Example of input STAR file:

data_optics

loop_ 
_rlnVoltage #1 
_rlnImagePixelSize #2 
_rlnSphericalAberration #3 
_rlnAmplitudeContrast #4 
_rlnOpticsGroup #5 
_rlnImageSize #6 
_rlnImageDimensionality #7 
_rlnOpticsGroupName #8 
  300.000000     0.733000     2.700000     0.100000            1          360            2 ptcls_tilt_group0000 
  300.000000     0.733000     2.700000     0.100000            2          360            2 ptcls_tilt_group0001 
  300.000000     0.733000     2.700000     0.100000            3          360            2 ptcls_tilt_group0002 
  300.000000     0.733000     2.700000     0.100000            4          360            2 ptcls_tilt_group0003 

data_particles

loop_ 
_rlnImageName #1 
_rlnAngleRot #2 
_rlnAngleTilt #3 
_rlnAnglePsi #4 
_rlnOriginXAngst #5 
_rlnOriginYAngst #6 
_rlnDefocusU #7 
_rlnDefocusV #8 
_rlnDefocusAngle #9 
_rlnPhaseShift #10 
_rlnCtfBfactor #11 
_rlnOpticsGroup #12 
_rlnRandomSubset #13 
_rlnClassNumber #14 
000001@J54/extract/009103141063873076924_FoilHole_29471620_Data_29469929_0_20250429_122313_EER_patch_aligned_doseweighted_particles.mrcs   110.093651   102.101463   111.729355     -0.37108     0.364595 22905.933594 22860.035156   268.941101     0.000000     0.000000            1            2            1 
000002@J54/extract/009103141063873076924_FoilHole_29471620_Data_29469929_0_20250429_122313_EER_patch_aligned_doseweighted_particles.mrcs    -30.59649    98.797394    -96.29035     0.618469     -0.70093 23362.289062 23316.390625   268.941101     0.000000     0.000000            1            1            1 

mbelouso avatar Jun 12 '25 04:06 mbelouso

My best guess is that how particles are normalized and/or how they are grouped are different. Re-extracting in RELION might help.

Because we don't use CS and CS is a huge black box, this is very hard to investigate, if not impossible. Perhaps asking this in the CCPEM mailing list might be better, because many people seem to go back and forth between two programs.

biochem-fan avatar Jun 12 '25 04:06 biochem-fan

Yeah, I normally go back and foward too between the software, but normally start in RELION and only import ptcle stacks into cSPARC if I have to..... So the:

corrupted double-linked list

Is not informative?

mbelouso avatar Jun 12 '25 05:06 mbelouso

So for some more troubleshooting:

If I run the command:

`which relion_refine_mpi` --o Class3D/job001/run --i J55_newoptics.star --ref J55_010_volume_map.mrc --firstiter_cc --trust_ref_size --ini_high 16 --dont_combine_weights_via_disc --pool 30 --pad 1  --ctf --iter 25 --tau2_fudge 4 --particle_diameter 170 --fast_subsets  --K 1 --flatten_solvent --zero_mask --strict_highres_exp 4 --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale  --j 1 --gpu ""  --pipeline_control Class3D/job001/

So basically the same as everything else, but turning off BLUSH, it fails at the maximisation step with this error:

corrupted size vs. prev_size
[piastri:1834812] *** Process received signal ***
[piastri:1834812] Signal: Aborted (6)
[piastri:1834812] Signal code:  (-6)

mbelouso avatar Jun 12 '25 05:06 mbelouso

It suggests some metadata are broken, but I don't know which. For example, rlnMicrographName and rlnGroupNumber are missing, which are used to make groups for scaling and sigma estimation. I don't know if they are essential, though.

Are you using cs2star or something? Asking the author or users of the script might help.

biochem-fan avatar Jun 12 '25 05:06 biochem-fan

It suggests some metadata are broken, but I don't know which. For example, rlnMicrographName and rlnGroupNumber are missing, which are used to make groups for scaling and sigma estimation. I don't know if they are essential, though.

Are you using cs2star or something? Asking the author or users of the script might help.

Yeah, but to get the bottom of it, you are right, it must have been something to do with particle normalization as when I ran a an Extract job in RELION on the curated cryoSPARC .star file, the 3D classification ran with no problems....

cheers

matt B

mbelouso avatar Jun 12 '25 06:06 mbelouso

What happens if you take a STAR file re-extracted by RELION (which contains rlnMicrographName and rlnGroupNumber) and swap particles (MRCS files) with those from CS? Then the full metadata are present but normalization is different. You can also try the other way, i.e., a cs2star output STAR file (which lacks some columns) but with RELION's particles.

biochem-fan avatar Jun 12 '25 06:06 biochem-fan

I'll give it a try and get back to you.

mbelouso avatar Jun 13 '25 04:06 mbelouso

What happens if you take a STAR file re-extracted by RELION (which contains rlnMicrographName and rlnGroupNumber) and swap particles

I'm not sure Extract writes rlnGroupNumber if it doesn't exist? When trying to find the micrograph with bad particles leading to the "ERROR!!! zero sum of weights...." I've had to check the run_itNNN_data.star file from the job that crashed as rlnGroupNumber gets assigned when the particles are read in.

huwjenkins avatar Jun 13 '25 17:06 huwjenkins

I just tested with the pre-calculated results from the RELION-5 tutorial and if the input to Class3D is a particle star file with no rlnMicrographName column all particles are assigned to rlnGroupNumber 1....

huwjenkins avatar Jun 13 '25 18:06 huwjenkins

particle star file with no rlnMicrographName column all particles are assigned to rlnGroupNumber 1....

This is likely the cause. Scaling is essentially disabled.

@mbelouso Additional things to check:

  • make sure you state "the reference is NOT in the right gray scale" in the GUI.
  • where did your reference come from? If it is not from RELION, make one with relion_reconstruct (if the angles and shifts from CS are compatible and good enough) is also a good idea.

biochem-fan avatar Jun 13 '25 23:06 biochem-fan

particle star file with no rlnMicrographName column all particles are assigned to rlnGroupNumber 1....

This is likely the cause. Scaling is essentially disabled.

@mbelouso Additional things to check:

  • make sure you state "the reference is NOT in the right gray scale" in the GUI.
  • where did your reference come from? If it is not from RELION, make one with relion_reconstruct (if the angles and shifts from CS are compatible and good enough) is also a good idea.

So.... Here is the interesting thing. I did as you suggested and run relion_reconstruct (referencing the cSPARC extracted particles) first and used that as the initial model for C3D. Using the original cSPARC extracted files it runs fine. However, in my first attempt at it (the one that reproduced the erros), I did "reference is NOT in the right gray scale"....

It seems that it is certainly something to do with particle normalization and map scaling.

mbelouso avatar Jun 14 '25 01:06 mbelouso