relion
relion copied to clipboard
Autopicking with gpu produces crazy FOM when --highpass is used with CUDA 11.X
I have reproduced #700 but have narrowed this down to a specific option and can reproduce with the tutorial dataset.
RELION version: 4.0-beta-2-commit-9b23e5 OS: Ubuntu 20.04.4 LTS GCC: 10.2.0 CUDA: 11.1.1 GPU: 2x NVIDIA A40
If I take the RELION-3.1 pre-calculated results (relion31_tutorial_precalculated_results.tar.gz
) and run the following:
mpiexec -n 20 `which relion_autopick_mpi` --i CtfFind/job003/micrographs_ctf.star --odir AutoPick/gpu/
--pickname autopick --ref Select/job009/class_averages.star --invert --ctf --ang 5 --shrink 0 --lowpass 20
--highpass 400 --angpix_ref 3.54 --threshold 0 --min_distance 100 --max_stddev_noise -1 --gpu ""
The particles picked have crazy high FOMs - first 10 particles from 20170629_00021_frameImage_autopick.star
:
loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnClassNumber #3
_rlnAutopickFigureOfMerit #4
_rlnAnglePsi #5
2585.005898 1839.982364 3 2.660857e+06 135.000000
1794.829422 3510.641197 3 21714.009766 125.000000
1941.576482 959.500006 4 15739.113281 330.000000
711.158828 1941.576482 1 4782.305176 65.000000
2709.176486 2686.600016 2 3601.117432 305.000000
2810.770605 2189.917660 1 3312.163086 50.000000
925.635300 1919.000011 1 2640.237305 70.000000
338.647061 1998.017659 3 1726.740601 205.000000
2268.935307 1952.864717 4 1640.216553 160.000000
778.888240 3510.641197 4 1427.013428 155.000000
if I remove --highpass 400
the FOMs for particles picked from the same micrograph looks fine:
loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnClassNumber #3
_rlnAutopickFigureOfMerit #4
_rlnAnglePsi #5
846.617652 180.611766 4 2.595012 40.000000
304.782355 440.241179 4 2.517174 290.000000
654.717651 146.747060 2 1.608543 75.000000
428.952944 270.917649 2 1.390473 225.000000
474.105885 417.664708 2 0.755993 170.000000
395.088238 158.035295 2 0.562077 185.000000
677.294122 327.358825 1 0.503259 210.000000
553.123533 304.782355 2 0.263983 145.000000
180.611766 417.664708 4 0.061370 115.000000
124.170589 203.188236 3 0.984458 135.000000
Also if I keep --highpass 400
but remove --gpu
to pick on CPU it's OK:
loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnClassNumber #3
_rlnAutopickFigureOfMerit #4
_rlnAnglePsi #5
846.617652 180.611766 4 2.630881 40.000000
304.782355 440.241179 4 2.454238 290.000000
654.717651 146.747060 2 1.726552 75.000000
767.600005 756.311769 2 1.665849 295.000000
699.870592 553.123533 2 1.631472 215.000000
936.923535 417.664708 2 1.576335 25.000000
496.682356 745.023534 2 1.515072 75.000000
428.952944 270.917649 2 1.371704 225.000000
654.717651 722.447063 2 1.117923 300.000000
824.041181 530.547062 4 0.831689 65.000000
I have yet to try building against an older version of CUDA (as suggested in #700) - but this was built with -DCUDA_ARCH='86'
so using older CUDA versions will require changing this too. Any ideas why certain CUDA versions are problematic?
OK I can reproduce with CUDA-11.3.1 on a GPU node on our cluster (V100 cards) so can more easily roll back CUDA versions now. I just have to wait in the queue for a while...
OS: CentOS Linux release 7.9.2009 GCC: 8.3.0 CUDA: 10.1 RELION: 4.0-beta-2-commit-9b23e5
loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnClassNumber #3
_rlnAutopickFigureOfMerit #4
_rlnAnglePsi #5
846.617652 180.611766 4 2.630840 40.000000
304.782355 440.241179 4 2.454258 290.000000
654.717651 146.747060 2 1.726552 75.000000
767.600005 756.311769 2 1.665827 295.000000
699.870592 553.123533 2 1.631469 215.000000
936.923535 417.664708 2 1.576337 25.000000
496.682356 745.023534 2 1.515058 75.000000
428.952944 270.917649 2 1.371666 225.000000
654.717651 722.447063 2 1.117947 300.000000
824.041181 530.547062 4 0.831686 65.000000
Definitely appears to be CUDA 11.X issue
OS: CentOS Linux release 7.9.2009 GCC: 8.3.0 CUDA: 11.7.0 RELION: 4.0-beta-2-commit-9b23e5
loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnClassNumber #3
_rlnAutopickFigureOfMerit #4
_rlnAnglePsi #5
2585.005898 1839.982364 3 2.669658e+06 135.000000
1794.829422 3510.641197 3 21775.601562 125.000000
1941.576482 959.500006 4 15858.632812 330.000000
711.158828 1941.576482 1 4782.303711 65.000000
2709.176486 2686.600016 2 3596.824219 305.000000
2810.770605 2189.917660 1 3312.802246 50.000000
925.635300 1919.000011 1 2640.374756 70.000000
338.647061 1998.017659 3 1726.810059 205.000000
2268.935307 1952.864717 4 1640.391113 160.000000
778.888240 3510.641197 4 1426.978394 155.000000
Only difference in sbatch
script was to module load
RELION built with CUDA 11.7 vs 10.1