relion icon indicating copy to clipboard operation
relion copied to clipboard

Error with relion5 using 2D classification on aws g6 instances

Open Cookiemaster33 opened this issue 1 year ago • 1 comments

Hi there I am using relion5 running via SGE/qsub on aws clusters.

So far everything was running fine on g5 instances which use a NVIDIA A10G Tensor Core GPUs. We now switched to g6 instances which use NVIDIA L4 Tensor Core GPUs. During 2D classification we get the error: "failed to create cuffs plan".

Any idea what could be wrong?

Thanks and best

Toby

Environment:

  • OS: Ubuntu 18.04.5 LTS
  • MPI runtime: [e.g. OpenMPI 2.0.1]
  • RELION version: Relion 5.0
  • Memory: 192 GB
  • GPU: NVIDIA L4 Tensor Core GPU

Dataset:

  • Box size: 180 pix
  • Pixel size: 0.71 Å/px
  • Number of particles: 50,000

Job options:

  • Type of job: Class2D
  • Number of MPI processes: 1
  • Number of threads: 12
  • Full command: which relion_refine --o Class2D/job010/run --grad --class_inactivity_threshold 0.1 --grad_write_iter 10 --iter 100 --i Extract/job006/particles.star --dont_combine_weights_via_disc --pool 30 --pad 2 --ctf --tau2_fudge 2 --particle_diameter 198.0 --K 25 --flatten_solvent --zero_mask --center_classes --oversampling 1 --psi_step 12 --offset_range 5 --offset_step 2 --norm --scale --j 12 --gpu "0,1,2,3" --pipeline_control Class2D/job010/

Error message:

in: /relion/src/projector.cpp, line 362 ERROR: failed to create cufft plan === Backtrace === /opt/relion/bin/relion_refine(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x77) [0x56106c48bbd7] /opt/relion/bin/relion_refine(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibbiPKS1_b+0x36a3) [0x56106c52a8c3] /opt/relion/bin/relion_refine(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x901) [0x56106c69d271] /opt/relion/bin/relion_refine(_ZN11MlOptimiser16expectationSetupEv+0x5a) [0x56106c4b16ea] /opt/relion/bin/relion_refine(_ZN11MlOptimiser11expectationEv+0x34) [0x56106c4e1824] /opt/relion/bin/relion_refine(_ZN11MlOptimiser7iterateEv+0x37a) [0x56106c4fd63a] /opt/relion/bin/relion_refine(main+0x51) [0x56106c476c91] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x14c1b6623bf7] /opt/relion/bin/relion_refine(_start+0x2a) [0x56106c47a5ea]

ERROR: failed to create cufft plan

Cookiemaster33 avatar Jun 20 '24 14:06 Cookiemaster33

Which version of CUDA did you use to compile RELION? Is it compatible with "Ubuntu 18.04.5 LTS"? This is very very old OS and you shouldn't use it.

Did you specify CUDA_ARCH? (You shouldn't, if you want to share the binary with different GPUs).

biochem-fan avatar Jun 20 '24 21:06 biochem-fan