relion relion 5 refinement failed with memory issue

Hi I am trying to do the 3D auto-refine with relion 5 and having following issue. I saw a similar issue with relion 3 but I am not sure if we can solve this using same method for relion 5.

Thank you Dhiraj

ERROR: out of memory in /home/lvantol/relion5/relion/src/acc/cuda/custom_allocator.cuh at line 436 (error-code 2)

in: /home/lvantol/relion5/relion/src/acc/cuda/cuda_settings.h, line 65

ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which

are incompatible with either the data or your installation of relion.

If you

-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50

and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), or

a similar mismatch on AMD or Intel GPUs, this may happen.

-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs at least NVIDIA

GPUs with compute 5.0 or AMD MI GPUs with archtiecture gfx906.

You may be trying to use a GPU architectures older than these.

If you have multiple generations, try specifying --gpu <X>

with X=0. Then try X=1 in a new run, and so on. The numbering of

GPUs may not be obvious from the driver or intuition. For a list

of NVIDIA and AMD GPU compute generations and architectures, see

en.wikipedia.org/wiki/CUDA#Version_features_and_specifications and

docs.amd.com/bundle/Hardware_and_Software_Reference_Guide/page/Hardware_and_Software_Support.html

-> ARE USING DOUBLE-PRECISION GPU CODE: relion has been written so

as to not require this, and may thus have unforeseen requirements

when run in this mode. If you think it is nonetheless necessary,

please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is

subject to many restrictions, and relion is written to work within

common restraints. If you have exotic data or settings, unexpected

configurations may occur. See also above point regarding

double precision.

If none of the above applies, please report the error to the relion

developers at github.com/3dem/relion/issues.

=== Backtrace ===

/opt/relion5beta/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x67) [0x493557]

/opt/relion5beta/bin/relion_refine_mpi() [0x6e89d5]

/opt/relion5beta/bin/relion_refine_mpi(_ZN14MlDeviceBundle24setupTunableSizedObjectsEm+0x51c) [0x6ec8cc]

/opt/relion5beta/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x174a) [0x4b54da]

/opt/relion5beta/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xb6) [0x4c1306]

/opt/relion5beta/bin/relion_refine_mpi(main+0x52) [0x47f222]

/lib64/libc.so.6(__libc_start_main+0xe5) [0x7fc0b3d87d85]

/opt/relion5beta/bin/relion_refine_mpi(_start+0x2e) [0x482c5e]

==================

Dec 20 '23 19:12 dhiraj82

Please respect our issue template.

Without details of your dataset and your hardware, we cannot diagnose your problem.

Dec 21 '23 00:12 biochem-fan

Sorry I missed it previously. please find below the system and data set information. I know data set is not a problem as I can process same data on a different computer managed by university (cluster). Please let me know if more information is needed. Thank you Dhiraj

Environment:

OS: Rocky Linux 8.8 MPI runtime: mpich-3.2.1 RELION version: RELION 5.0-beta-0-commit-70875e Memory: 500 GB GPU: NVIDIA RTX A6000 (4 gpu 48 gb each)

Dataset:

Box size: 320 Pixel size: 1.61 Number of particles: ~100,0000

Job options:

Type of job: Refine3D masked Number of MPI processes: 3 Number of threads: 12

command : which relion_refine_mpi --continue Refine3D/job006/run_it000_optimiser.star --o Refine3D/job008/run --dont_combine_weights_via_disc --no_parallel_disc_io --preread_images --pool 3 --pad 1 --particle_diameter 160 --j 12 --gpu "0,1,2" --pipeline_control Refine3D/job008/

Dec 21 '23 01:12 dhiraj82

Does the problem happen for other datasets on this machine?

With a box size of 320 and 48 GB of VRAM, the GPU memory should be more than enough (unless you are running other programs simultaneously; check nvidia-smi and reboot the computer in case of doubt).

--gpu "0,1,2": Why did you put this? Do you really want to use only 3 out of 4 GPUs? If you want to use all four GPUs, you can just say "Use GPU: yes" without specifying any numbers. For best performance, you should have an odd number of MPI processes (3, 5, 9, ... etc) because the first MPI rank only coordinates jobs and does not perform real computation. You should place 1, 2 or 3 processes per GPU. More processes mean more VRAM is required. We don't recommend using multiple GPUs per MPI process. In your current form, one MPI uses 3 GPUs (comma separates threads, colon separates processes).

Dec 21 '23 04:12 biochem-fan

There is no other job running. It's happening with all the data set. I wanted to try blush algorithm so we recently installed relion 5 and ever since then we can not go beyond 2D class as it require mpi. we used CUDA_ARCH=8.6 during compilation. can this be a problem? relion 4 was installed from the vendor we got our computer (exxact) so we don't know how they installed it.

Thanks Dhiraj

Dec 21 '23 05:12 dhiraj82

CUDA_ARCH=8.6 during compilation

You must use 86 not 8.6. If 86 still fails, please try without specifying CUDA_ARCH.

When you try this, you have to completely remove your build directory. Rerunning cmake and make is not necessarily enough.

Dec 21 '23 05:12 biochem-fan

I tried both without specifying and with 86 and it's still not working. I am getting same error.

Dec 22 '23 14:12 dhiraj82

For me, it may have been due to insufficient VRAM. After modifying the number of processes, I succeeded.

Dec 25 '23 10:12 HNUWangpx

VRAM is not an issue for me as I have 4X 48gb ram GPU. I can do the same calculation on a different computer with only 4X 11 gb VRAM. Also, I am using only 3 mpi processes. I can not go any less than that.

Dec 25 '23 18:12 dhiraj82

Update: problem is solved. the problem was in GPU use option. At certain time by mistake I started using like 0,1,2,... and so on. it was working on our cluster with different architecture. However it was not working in our own computer. I have to use 0:0:0,1:1:1 while using our computer and now it's doing the calculation. I have an issue though. if I am writing 0:0:0,1:1:1,2:2:2 then relion is using almost entire volatile GPU ram of only GPU 0 and not other GPUs. Is there any way to distribute volatile GPU ram usage to other GPUs as well. in Relion 5 manual, it's written to write 0:1:2:3 for GPU use option but its failing on my computer (I tried 0:1:2 with 3 mpi, I even tried 0:1 or 0:0,1:1 but it failed again).

Thank you Dhiraj

Dec 27 '23 03:12 dhiraj82

Please don't use screenshots to show text information.

Which command line does your screenshot correspond to? It appears that you used 3 MPI processes (i.e. 2 working processes) with 0,1,2,3:0,1,2,3, because I see two process IDs (1480046, 1480047), each using all four GPUs (which is not ideal).

Did you read my earlier comment https://github.com/3dem/relion/issues/1056#issuecomment-1865457090? Do you want to use 3 GPUs or 4 GPUs? If you want and can use 4 GPUs, you should specify 5 MPI processes and leave the --gpu argument blank to let RELION use all GPUs by default. This is the same as saying 0:1:2:3.

Dec 27 '23 06:12 biochem-fan

Hi Sorry for the screen shot. what you suggested (0,1,2,3:0,1,2,3) is giving me memory error. I tried 0:0 or 0:0:0,1:1:1 and other combination and I am getting memory error. only thing that's working is 0:0:0 or 0:0:0,1:1:1,2:2:2. Even then it's using all four gpu and one of the gpu's power consumption is high and volatile GPU usage is almost 100 %. this use of all 4 GPU vram even if asking for only one gpu is not new on this machine. even if we use cryosparc, it do the same thing. I am concerned about too much power usage by only one gpu. I am worried if its going to damage the GPUs. Thanks Dhiraj

Dec 27 '23 14:12 dhiraj82

even then it's using all four gpu and one of the gpu's power consumption is high and volatile GPU usage is almost 100 %. this use of all 4 GPU vram even if asking for only one gpu is not new on this machine. even if we use cryosparc, it do the same thing.

Something is very strange. Because the problem occurs in other programs as well, this is probably not a RELION issue. You should check driver installation etc.

I am concerned about too much power usage by only one gpu. I am worried if its going to damage the GPUs.

As long as it is cooled properly, you don't have to worry about the damage. However, imbalance means your calculation will be inefficient and slow.

Dec 27 '23 14:12 biochem-fan

I again having same issue with 3D refinement now. 3D classification working but with 3D refinement, its giving me same memory issue. I tried all the combination of GPU usage suggested above and whatever was working with 3D classification. but all of them are giving memory issue.

Dec 27 '23 15:12 dhiraj82

At the beginning of run.out, RELION prints out the association of GPUs with MPI processes and threads. Please paste it as text, together with your FULL command line in note.txt.

Dec 27 '23 22:12 biochem-fan

run.out output RELION version: 5.0-beta-0-commit-90d239 Precision: BASE=double, CUDA-ACC=single

=== RELION MPI setup ===

Number of MPI processes = 3
Number of threads per MPI process = 12
Total number of threads therefore = 36
Leader (0) runs on host = r124087.iowa.uiowa.edu =================
Follower 1 runs on host = r124087.iowa.uiowa.edu
Follower 2 runs on host = r124087.iowa.uiowa.edu

command - ++++ with the following command(s): which relion_refine_mpi --continue Refine3D/job028/run_it000_optimiser.star --o Refine3D/job028/run --dont_combine_weights_via_disc --preread_images --pool 3 --pad 1 --particle_diameter 120 --j 12 --gpu "2:2:2,1:1:1,0:0:0" --pipeline_control Refine3D/job028/

I tried --gpu "0,0,0:1,1,1:2,2,2" with the same outcome.

Thanks Dhiraj

Dec 27 '23 23:12 dhiraj82

Below this you should have something like as follows. I need this information.

uniqueHost rb-calc02 has 4 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 1 mapped to device 0
 Thread 1 on follower 1 mapped to device 0
 Thread 2 on follower 1 mapped to device 0
 Thread 3 on follower 1 mapped to device 0
 Thread 4 on follower 1 mapped to device 0
 Thread 5 on follower 1 mapped to device 0
 Thread 6 on follower 1 mapped to device 0
 Thread 7 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 2 mapped to device 1
 Thread 1 on follower 2 mapped to device 1
 Thread 2 on follower 2 mapped to device 1
 Thread 3 on follower 2 mapped to device 1

Dec 28 '23 00:12 biochem-fan

it is - uniqueHost r124087.iowa.uiowa.edu has 2 ranks. Follower 1 will distribute threads over devices 2 Thread 0 on follower 1 mapped to device 2 Thread 1 on follower 1 mapped to device 2 Thread 2 on follower 1 mapped to device 2 Thread 3 on follower 1 mapped to device 2 Thread 4 on follower 1 mapped to device 2 Thread 5 on follower 1 mapped to device 2 Thread 6 on follower 1 mapped to device 2 Thread 7 on follower 1 mapped to device 2 Thread 8 on follower 1 mapped to device 2 Thread 9 on follower 1 mapped to device 2 Thread 10 on follower 1 mapped to device 2 Thread 11 on follower 1 mapped to device 2 Follower 2 will distribute threads over devices 2 Thread 0 on follower 2 mapped to device 2 Thread 1 on follower 2 mapped to device 2 Thread 2 on follower 2 mapped to device 2 Thread 3 on follower 2 mapped to device 2 Thread 4 on follower 2 mapped to device 2 Thread 5 on follower 2 mapped to device 2 Thread 6 on follower 2 mapped to device 2 Thread 7 on follower 2 mapped to device 2 Thread 8 on follower 2 mapped to device 2 Thread 9 on follower 2 mapped to device 2 Thread 10 on follower 2 mapped to device 2 Thread 11 on follower 2 mapped to device 2 Device 2 on r124087.iowa.uiowa.edu is split between 2 followers Running CPU instructions in double precision.

Dec 28 '23 00:12 dhiraj82

You have only 3 MPI processes, which means only two working processes. --gpu "2:2:2,1:1:1,0:0:0" means the first process gets the device 2 and the second process also gets the device 2. Your log above confirmed this was interpreted as expected. There is no point specifying the latter half (2,1:1:1,0:0:0) because this part is for the third to the seventh processes, which were absent in your trial.

With 3 MPI processes, please report the results of --gpu, --gpu "0:0", --gpu "1:1", --gpu "2:2" and --gpu "3:3". According to your previous statements, I guess --gpu will fail while --gpu "2:2" will succeed but let's confirm all of them.

Dec 28 '23 00:12 biochem-fan