DeepSpeed Retrieve CUDA available memory via `torch.cuda.mem_get

This PR refactors the available_memory() method for the CUDA accelerator to use free, total = torch.cuda.mem_get_info(). It also removes the hard dependency pynvml.

Related PR:

#4508

The torch.cuda.mem_get_info() function was added two years ago (May 26th, 2021). We have already relied on torch.cuda.is_bf16_supported() without a torch version check in the next method below. The torch.cuda.is_bf16_supported() function was added on August 26th, 2021. So we can assume the torch.cuda.mem_get_info() function is always available for the torch version we support.

Rational

The official NVML Python binding package is nvidia-ml-py rather than pynvml on PyPI. See the documentation on https://pypi.org/project/pynvml:

This is a wrapper around the NVML library. For information about the NVML library, see the NVML developer page http://developer.nvidia.com/nvidia-management-library-nvml

As of version 11.0.0, the NVML-wrappers used in pynvml are identical to those published through nvidia-ml-py.
Having pynvml will add an extra dependency. It will also break the users' Python environment if they have nvidia-ml-py installed. Because both pynvml and nvidia-ml-py provide the pynvml module. We can rely on torch.cuda.mem_get_info() where no extra dependency will be added.
Handling the CUDA_VISIBLE_DEVICES environment variable is very complex. The variable can be a comma-separated list of integers or UUID strings. Currently, we only support integers. The torch.cuda.mem_get_info() directly calls the CUDA API which does not need index conversion between CUDA and NVML.

https://github.com/microsoft/DeepSpeed/blob/6d7b44a838548d2e1878439613e1fbc17ddcfaf0/accelerator/cuda_accelerator.py#L156-L169

$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-3cd9eb06-03f4-3b39-2f7b-48ee826b0a26)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-611f484b-7a5a-f1ae-5aac-64d2ddad1ab6)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-ba171e16-8df7-e1c4-5468-2ee35e18d1f0)
GPU 3: NVIDIA GeForce RTX 3090 (UUID: GPU-66bd9aec-436e-24eb-91e8-d31d6370d8f0)
GPU 4: NVIDIA GeForce RTX 3090 (UUID: GPU-9cc6b251-34a2-db9d-4ca0-7532f951aad2)
GPU 5: NVIDIA GeForce RTX 3090 (UUID: GPU-a6c609c1-078d-e47e-b418-8008e61a8cf6)
GPU 6: NVIDIA GeForce RTX 3090 (UUID: GPU-be37798a-62fb-ebee-90d2-01b018d81c6d)
GPU 7: NVIDIA GeForce RTX 3090 (UUID: GPU-8b2e78db-cff8-bb89-d9fd-64f1633df658)

$ export CUDA_VISIBLE_DEVICES="GPU-ba171e16,GPU-611f484b,GPU-3cd9eb06"

$ ipython

In [1]: import torch

In [2]: torch.cuda.memory_allocated(0)
Out[2]: 0

In [3]: torch.cuda.get_device_properties(0).total_memory
Out[3]: 25447170048

In [4]: torch.cuda.mem_get_info(0)
Out[4]: (510328832, 25447170048)

In [5]: from nvitop import CudaDevice

In [6]: cuda0 = CudaDevice(0)
   ...: cuda0
Out[6]: CudaDevice(cuda_index=0, nvml_index=2, name="NVIDIA GeForce RTX 3090", total_memory=24.00GiB)

In [7]: cuda0.memory_free()
Out[7]: 510328832

In [8]: cuda0.memory_used()
Out[8]: 24936841216

In [9]: cuda0.memory_total()
Out[9]: 25769803776

Dec 20 '23 14:12 XuehaiPan

Hi @XuehaiPan - thank you for the contribution. If I recall correctly, we had to use pynvml because we were getting inaccurate memory information from torch in some scenarios. @jeffra may be able to comment more on this.

Either way, I will try out this branch and see if that is still the case. In particular, this code is necessary for FastGen and DeepSpeed-MII.

Dec 20 '23 18:12 mrwyattii

Hi @XuehaiPan - thank you for the contribution. If I recall correctly, we had to use pynvml because we were getting inaccurate memory information from torch in some scenarios. @jeffra may be able to comment more on this.

Either way, I will try out this branch and see if that is still the case. In particular, this code is necessary for FastGen and DeepSpeed-MII.

If we aren't able to switch over, would it at least make sense to move to the nvidia-ml-py package as it is more regularly updated and at least matches the cuda version?

Jan 02 '24 17:01 loadams

DeepSpeed
DeepSpeed copied to clipboard

Retrieve CUDA available memory via `torch.cuda.mem_get_info()`

Rational

DeepSpeed DeepSpeed copied to clipboard

Retrieve CUDA available memory via `torch.cuda.mem_get_info()`

Rational

DeepSpeed
DeepSpeed copied to clipboard