alphafold icon indicating copy to clipboard operation
alphafold copied to clipboard

How can I use multiple GPUs' video memory?

Open xlminfei opened this issue 2 years ago • 9 comments

I try to run multimer for a 3200aa homotetramer which has four of peptide chain with 2 * A100 40G. It consumes all video memory od one of the A100 and about 72GB memory. At the same time, the other A100 doesn't seem to be working and its video memory usage is close to zero. It tooks about 10 hours to complete the first model. And I meet the same problem when run multimer in the same 3200aa homotetrame whether in 4 * V100 32G or 4 * P40.

xlminfei avatar May 31 '22 17:05 xlminfei

I've observed the same, adding GPUs for a job with memory requirements that surpass the available memory in one GPU, doesn't reduce the computing time. Unified Memory uses the RAM from one GPU plus system's RAM, but not form other GPUs. In this web page: https://elearning.bits.vib.be/courses/alphafold/lessons/alphafold-on-the-hpc/topic/computational-limits/ they state that : "AlphaFold supports multi-GPU usage, which means that when using two GPUs, the GPU memory can be almost doubled. Like this, longer sequences can be processed on a cluster with limited individual GPU memory. Note that you should increase the number of allocated CPU cores when allocating multiple GPUs." but I haven't been able to reproduce it.

abiadak avatar Jun 03 '22 08:06 abiadak

I think you are seeing this maybe? I noticed that the bottleneck seems to be relaxation and not the unrelaxed prediction:

When two GPUs are used, both identical model / mem size, with gpu relaxation enabled, each running its own folding with respective gpu ids - while the unrelaxed prediction seems to run smoothly without interference, when one of the gpus came off to do a very long relaxation (e.g. large protein size), it seem to always make the other gpu wait for it in the relaxation phase, until it finishes and then both will be released back to run unrelaxed gpu prediction - thus creating a bottleneck by the longer relaxation. Note that the “stuck” only seem to happen once the shorter task GPU goes into relaxation, because if it is still in unrelaxed prediction, it will stay on to finish that without problem. So there it seems like they both have independent unrelaxed prediction but converge on some bottleneck in the relaxation part, more pronounced if one of them is doing larger protein; small proteins seem unnoticeable.

Best, David

davidyanglee avatar Jun 03 '22 13:06 davidyanglee

Although this does not address the potential bottleneck issue mentioned above, personally I set these environment variables to allow a single AlphaFold prediction to pool the memory across multiple available GPUs.

TF_FORCE_UNIFIED_MEMORY=1 XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

Although, normally those variables are assigned in run_docker.py here, so they shouldn't have to be set manually when using Docker. You can potentially increase the value of XLA_PYTHON_CLIENT_MEM_FRACTION since that may be limiting the memory pool: #149

(I'm not support though, I just had a similar issue once, but this may be a different problem.)

epenning avatar Jun 03 '22 15:06 epenning

Hi @epenning, the bottleneck problem is very interesting. Setting the TF_FORCE_UNIFIED_MEMORY=0 or even using the ENABLE_GPU_RELAX=False still gives the same bottle neck, so it is not necessarily GPU or GPU memory related it seems. It seems to be when the system enters a stage of single thread processing, that there is some resource queuing there. Are you seeing the same issue yourself? Best, David.

davidyanglee avatar Jun 03 '22 16:06 davidyanglee

@davidyanglee I actually haven't observed the issue you described yet myself, but only because lately I haven't been running any large proteins in parallel with relaxation enabled, so I haven't encountered the circumstances which might produce that bottleneck.

epenning avatar Jun 03 '22 20:06 epenning

@epenning hello, I have set TF_FORCE_UNIFIED_MEMORY=1 and XLA_PYTHON_CLIENT_MEM_FRACTION=4.0, but the same thing happens

xlminfei avatar Jun 04 '22 09:06 xlminfei

It actually turns out that I have the same problem too, even with TF_FORCE_UNIFIED_MEMORY=1 and increased XLA_PYTHON_CLIENT_MEM_FRACTION value. It seems like the higher XLA_PYTHON_CLIENT_MEM_FRACTION let AlphaFold use more of a single GPU's memory which made me think I had solved the problem, but it's still not using the memory from the other GPUs. If the sequence is too long it crashes with an out of memory error.

Wed Jun  8 15:36:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:02:00.0 Off |                  Off |
|  0%   48C    P2    75W / 230W |  16122MiB / 16125MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 5000     On   | 00000000:03:00.0 Off |                  Off |
|  0%   32C    P8    10W / 230W |    120MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 5000     On   | 00000000:82:00.0 Off |                  Off |
|  0%   32C    P8     2W / 230W |    120MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 5000     On   | 00000000:83:00.0 Off |                  Off |
|  0%   31C    P8    10W / 230W |    120MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

epenning avatar Jun 08 '22 20:06 epenning

@epenning , thanks for verifying. On my machine, I also get the same problem no matter if I set TF_FORCE_UNIFIED_MEMORY=0 or 1, or create separate venv, or use 2 different containers. Intriguingly, the hanging problem for me only occurs during the long relaxation step and only when I use gpu relax. In fact, this is not only a problem during the model prediction step: during the long relaxation step, if I even start another new task (e.g. jackhmmer) with a different gpu specified, it will hang at the no tpu found (normal) step, until the long relaxation exists. The problem mostly resolves (no interference with the shorter task) when the relaxation for the long task is set to use CPU only (ok to leave the shorter task with GPU relax). However, that is not really a solution as CPU relaxation takes longer, and even longer with lengthy proteins. So there seems to be a bottle neck / limiting resource during a long gpu relaxation step.

davidyanglee avatar Jun 09 '22 03:06 davidyanglee

I tried to understand what's happening with the multiple GPU memory issue, since I have a large complex I'm trying to predict which is taking much longer than expected. I'm not very knowledgeable in this area though, and did not find any solution. In case it's useful to the AlphaFold team or anyone else dealing with this issue, here's what I did find:

  • CUDA Unified Memory manages a common memory space including all of the GPU and system memory.
  • Because AlphaFold can only utilize one GPU, once that GPU's memory is filled, the process expands to use system memory. So effectively, using a high value of XLA_PYTHON_CLIENT_MEM_FRACTION just increases the amount of system memory available for use.
  • When that GPU tries to use data stored in system memory, it will not be as fast as if it was stored in the GPU memory. It seems like when the GPU tries to access data stored in system memory, that data gets migrated back to the GPU to be accessed? Maybe this could cause a significant slowdown in AlphaFold prediction time.
  • If migrating data from one GPU to another is faster than migrating from CPU to GPU, then I would think prediction time could be improved if the memory in the other GPUs was used instead of system memory.
  • Several sources indicate that it is not necessarily trivial to get Unified Memory to preferentially use the other GPUs instead of the system memory: Nvidia Forums, Tensorflow
  • Tensorflow is probably responsible for managing this behavior, so it's unclear whether AlphaFold can even fix this issue without significant effort.

So my takeaway so far is that this might be expected behavior? I'd love to be corrected though, so I'll wait and see if anyone from AlphaFold has a better response!

epenning avatar Jun 10 '22 20:06 epenning

I tried to understand what's happening with the multiple GPU memory issue, since I have a large complex I'm trying to predict which is taking much longer than expected. I'm not very knowledgeable in this area though, and did not find any solution. In case it's useful to the AlphaFold team or anyone else dealing with this issue, here's what I did find:

  • CUDA Unified Memory manages a common memory space including all of the GPU and system memory.
  • Because AlphaFold can only utilize one GPU, once that GPU's memory is filled, the process expands to use system memory. So effectively, using a high value of XLA_PYTHON_CLIENT_MEM_FRACTION just increases the amount of system memory available for use.
  • When that GPU tries to use data stored in system memory, it will not be as fast as if it was stored in the GPU memory. It seems like when the GPU tries to access data stored in system memory, that data gets migrated back to the GPU to be accessed? Maybe this could cause a significant slowdown in AlphaFold prediction time.
  • If migrating data from one GPU to another is faster than migrating from CPU to GPU, then I would think prediction time could be improved if the memory in the other GPUs was used instead of system memory.
  • Several sources indicate that it is not necessarily trivial to get Unified Memory to preferentially use the other GPUs instead of the system memory: Nvidia Forums, Tensorflow
  • Tensorflow is probably responsible for managing this behavior, so it's unclear whether AlphaFold can even fix this issue without significant effort.

So my takeaway so far is that this might be expected behavior? I'd love to be corrected though, so I'll wait and see if anyone from AlphaFold has a better response!

Hi, I also met this problem, have you solved it? Can you tell me how to deal with it?

Violet969 avatar Dec 31 '22 08:12 Violet969

Hi, I am not an expert here. For me at least the hangup generally occurs during the relaxation part. So I just perform the unrelaxed and the relaxed part separately. And I run into little problem of hang up and delay this way. David.

davidyanglee avatar Jan 04 '23 17:01 davidyanglee

Hi, I am not an expert here. For me at least the hangup generally occurs during the relaxation part. So I just perform the unrelaxed and the relaxed part separately. And I run into little problem of hang up and delay this way. David.

Thanks for your reply, I solved this problem by change the XLA_PYTHON_CLIENT_MEM_FRACTION number to 2.

Violet969 avatar Jan 05 '23 06:01 Violet969

Hi @xlminfei, the provided implementation of AlphaFold can not be split across multiple GPUs (model parallelism). Note most of the time taken to get the first prediction will be MSA search. See a similar issue - https://github.com/deepmind/alphafold/issues/673. Note that the relax stage can also take a long time, and will take even longer when not running on gpu. @abiadak, please note that https://elearning.bits.vib.be/courses/alphafold/lessons/alphafold-on-the-hpc/topic/computational-limits/ is not official documentation and therefore not necessarily correct.