relion icon indicating copy to clipboard operation
relion copied to clipboard

Autopicking with gpu produces crazy FOM when --highpass is used with CUDA 11.X

Open huwjenkins opened this issue 2 years ago • 3 comments

I have reproduced #700 but have narrowed this down to a specific option and can reproduce with the tutorial dataset.

RELION version: 4.0-beta-2-commit-9b23e5 OS: Ubuntu 20.04.4 LTS GCC: 10.2.0 CUDA: 11.1.1 GPU: 2x NVIDIA A40

If I take the RELION-3.1 pre-calculated results (relion31_tutorial_precalculated_results.tar.gz) and run the following:

mpiexec -n 20 `which relion_autopick_mpi` --i CtfFind/job003/micrographs_ctf.star --odir AutoPick/gpu/ 
--pickname autopick --ref Select/job009/class_averages.star --invert  --ctf  --ang 5 --shrink 0 --lowpass 20 
--highpass 400 --angpix_ref 3.54 --threshold 0 --min_distance 100 --max_stddev_noise -1  --gpu ""  

The particles picked have crazy high FOMs - first 10 particles from 20170629_00021_frameImage_autopick.star:

loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnClassNumber #3
_rlnAutopickFigureOfMerit #4
_rlnAnglePsi #5
 2585.005898  1839.982364            3 2.660857e+06   135.000000
 1794.829422  3510.641197            3 21714.009766   125.000000
 1941.576482   959.500006            4 15739.113281   330.000000
  711.158828  1941.576482            1  4782.305176    65.000000
 2709.176486  2686.600016            2  3601.117432   305.000000
 2810.770605  2189.917660            1  3312.163086    50.000000
  925.635300  1919.000011            1  2640.237305    70.000000
  338.647061  1998.017659            3  1726.740601   205.000000
 2268.935307  1952.864717            4  1640.216553   160.000000
  778.888240  3510.641197            4  1427.013428   155.000000

if I remove --highpass 400 the FOMs for particles picked from the same micrograph looks fine:

loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnClassNumber #3
_rlnAutopickFigureOfMerit #4
_rlnAnglePsi #5
  846.617652   180.611766            4     2.595012    40.000000
  304.782355   440.241179            4     2.517174   290.000000
  654.717651   146.747060            2     1.608543    75.000000
  428.952944   270.917649            2     1.390473   225.000000
  474.105885   417.664708            2     0.755993   170.000000
  395.088238   158.035295            2     0.562077   185.000000
  677.294122   327.358825            1     0.503259   210.000000
  553.123533   304.782355            2     0.263983   145.000000
  180.611766   417.664708            4     0.061370   115.000000
  124.170589   203.188236            3     0.984458   135.000000

Also if I keep --highpass 400 but remove --gpu to pick on CPU it's OK:

loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnClassNumber #3
_rlnAutopickFigureOfMerit #4
_rlnAnglePsi #5
  846.617652   180.611766            4     2.630881    40.000000
  304.782355   440.241179            4     2.454238   290.000000
  654.717651   146.747060            2     1.726552    75.000000
  767.600005   756.311769            2     1.665849   295.000000
  699.870592   553.123533            2     1.631472   215.000000
  936.923535   417.664708            2     1.576335    25.000000
  496.682356   745.023534            2     1.515072    75.000000
  428.952944   270.917649            2     1.371704   225.000000
  654.717651   722.447063            2     1.117923   300.000000
  824.041181   530.547062            4     0.831689    65.000000

I have yet to try building against an older version of CUDA (as suggested in #700) - but this was built with -DCUDA_ARCH='86' so using older CUDA versions will require changing this too. Any ideas why certain CUDA versions are problematic?

huwjenkins avatar Jul 01 '22 12:07 huwjenkins

OK I can reproduce with CUDA-11.3.1 on a GPU node on our cluster (V100 cards) so can more easily roll back CUDA versions now. I just have to wait in the queue for a while...

huwjenkins avatar Jul 01 '22 16:07 huwjenkins

OS: CentOS Linux release 7.9.2009 GCC: 8.3.0 CUDA: 10.1 RELION: 4.0-beta-2-commit-9b23e5

loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnClassNumber #3
_rlnAutopickFigureOfMerit #4
_rlnAnglePsi #5
  846.617652   180.611766            4     2.630840    40.000000
  304.782355   440.241179            4     2.454258   290.000000
  654.717651   146.747060            2     1.726552    75.000000
  767.600005   756.311769            2     1.665827   295.000000
  699.870592   553.123533            2     1.631469   215.000000
  936.923535   417.664708            2     1.576337    25.000000
  496.682356   745.023534            2     1.515058    75.000000
  428.952944   270.917649            2     1.371666   225.000000
  654.717651   722.447063            2     1.117947   300.000000
  824.041181   530.547062            4     0.831686    65.000000

Definitely appears to be CUDA 11.X issue

huwjenkins avatar Jul 01 '22 22:07 huwjenkins

OS: CentOS Linux release 7.9.2009 GCC: 8.3.0 CUDA: 11.7.0 RELION: 4.0-beta-2-commit-9b23e5

loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnClassNumber #3
_rlnAutopickFigureOfMerit #4
_rlnAnglePsi #5
 2585.005898  1839.982364            3 2.669658e+06   135.000000
 1794.829422  3510.641197            3 21775.601562   125.000000
 1941.576482   959.500006            4 15858.632812   330.000000
  711.158828  1941.576482            1  4782.303711    65.000000
 2709.176486  2686.600016            2  3596.824219   305.000000
 2810.770605  2189.917660            1  3312.802246    50.000000
  925.635300  1919.000011            1  2640.374756    70.000000
  338.647061  1998.017659            3  1726.810059   205.000000
 2268.935307  1952.864717            4  1640.391113   160.000000
  778.888240  3510.641197            4  1426.978394   155.000000

Only difference in sbatch script was to module load RELION built with CUDA 11.7 vs 10.1

huwjenkins avatar Jul 01 '22 22:07 huwjenkins