dask-cuda [BUG] LocalCUDACluster doesn't work with NVIDIA MIG

(py)nvml does not appear to be compatible with MIG, which prevents various Dask services from working correctly, for example 'LocalCUDACluster'.

While this isn't explicitly Dask-cuda's fault, the end result is the same. Adding this issue for others to reference, and for discussion of potential work arounds.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster(device_memory_limit=1.0, rmm_managed_memory=True)
client = Client(cluster)
---------------------------------------------------------------------------
NVMLError_NoPermission                    Traceback (most recent call last)
<ipython-input-1-48e0ebf5a2e9> in <module>
     33 
     34 
---> 35 cluster = LocalCUDACluster(device_memory_limit=1.0,
     36                            rmm_managed_memory=True)
     37 client = Client(cluster)

/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cuda/local_cuda_cluster.py in __init__(self, n_workers, threads_per_worker, processes, memory_limit, device_memory_limit, CUDA_VISIBLE_DEVICES, data, local_directory, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, ucx_net_devices, rmm_pool_size, rmm_managed_memory, jit_unspill, **kwargs)
    166             memory_limit, threads_per_worker, n_workers
    167         )
--> 168         self.device_memory_limit = parse_device_memory_limit(
    169             device_memory_limit, device_index=0
    170         )

/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cuda/utils.py in parse_device_memory_limit(device_memory_limit, device_index)
    478         device_memory_limit = float(device_memory_limit)
    479         if isinstance(device_memory_limit, float) and device_memory_limit <= 1:
--> 480             return int(get_device_total_memory(device_index) * device_memory_limit)
    481 
    482     if isinstance(device_memory_limit, str):

/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cuda/utils.py in get_device_total_memory(index)
    158     """
    159     pynvml.nvmlInit()
--> 160     return pynvml.nvmlDeviceGetMemoryInfo(
    161         pynvml.nvmlDeviceGetHandleByIndex(index)
    162     ).total

/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetMemoryInfo(handle)
   1286     fn = get_func_pointer("nvmlDeviceGetMemoryInfo")
   1287     ret = fn(handle, byref(c_memory))
-> 1288     check_return(ret)
   1289     return c_memory
   1290 

/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py in check_return(ret)
    364 def check_return(ret):
    365     if (ret != NVML_SUCCESS):
--> 366         raise NVMLError(ret)
    367     return ret
    368 

NVMLError_NoPermission: Insufficient Permissions

Apr 19 '21 20:04 drobison00

Thanks @drobison00 for filing this, indeed we'll have to find a way to work around NVML. I don't know if we have any ways to do so without creating a CUDA context ahead of time, it may be that the only way to do so is forcing the user to specify parameters such as total device memory.

Apr 19 '21 21:04 pentschev

https://github.com/gpuopenanalytics/pynvml/issues/30

Apr 20 '21 22:04 drobison00

I do not have access to an A100, but the latest (unreleased) version of pynvml should include MIG-supported NVML bindings. I believe we will need to modify get_device_total_memory to optionally pass a MIG device handle when necessary. As a first-order functionality test, someone could try adding a try/except for the current NVMLError and retry with a MIG handle - E.g.:

def get_device_total_memory(index=0):
    """
    Return total memory of CUDA device with index
    """
    pynvml.nvmlInit()
    try:
        return pynvml.nvmlDeviceGetMemoryInfo(
            pynvml.nvmlDeviceGetHandleByIndex(index)
        ).total
    except pynvml.NVMLError:
        return pynvml.nvmlDeviceGetMemoryInfo(
            pynvml.nvmlDeviceGetMigDeviceHandleByIndex(index)
        ).total

May 10 '21 23:05 rjzamora

Sounds like a good idea @rjzamora . A100s have been very scarce lately, I think we may be able to test that out in a week or two when Selene is open again to general usage.

May 11 '21 11:05 pentschev

Made the suggested changes on a GCP a100-mig system and hit the error below; the error spams continuously until the process is killed.

I tried using the latest pynvml from conda, as well as a manual install from source.

>>> client = Client(cluster)tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 16 memory: 315 MB fds: 45>>
Traceback (most recent call last):
  File "/root/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/root/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/distributed/system_monitor.py", line 96, in update
    gpu_extra = nvml.one_time()
  File "/root/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 47, in one_time
    "memory-total": pynvml.nvmlDeviceGetMemoryInfo(h).total,
  File "/root/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/pynvml/nvml.py", line 1984, in nvmlDeviceGetMemoryInfo
    _nvmlCheckReturn(ret)
  File "/root/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NoPermission: Insufficient Permissions

Jun 30 '21 19:06 drobison00

We've ran into perm issues in the past, though MIG things might required something else. Possible solutions are documented here: https://github.com/gpuopenanalytics/pynvml#nvml-permissions

Jun 30 '21 19:06 quasiben

I was able to atleast circumvent the Insufficient Permissions error with the following code:

def get_device_total_memory(index=0, migindex=0):
    """
    Return total memory of CUDA device with index
    """
    import pynvml
    pynvml.nvmlInit()
    try:
        return pynvml.nvmlDeviceGetMemoryInfo(
            pynvml.nvmlDeviceGetHandleByIndex(index)
        ).total
    except pynvml.NVMLError:
        return pynvml.nvmlDeviceGetMemoryInfo(
            pynvml.nvmlDeviceGetMigDeviceHandleByIndex(device = pynvml.nvmlDeviceGetHandleByIndex(index), index=migindex=0)
        ).total

However, as I understand from the API, nvmlDeviceGetMigDeviceHandleByIndex function expects both the handle for the parent GPU and an index for the specific MIG device (here I am passing in 0 for the mig device index) . I believe brief modifications in dask-cuda needs to be added in order to pass in the MIG device index as well if we start the LocalCUDACluster with multiple MIG instances?

Jul 12 '21 14:07 akaanirban

Yes, the code above is possible now with latest pyNVML. However, I feel that this is still a bit more complicated to handle, Dask-CUDA now needs to know which devices are MIG devices and which ones are not. Presently, it only relies upon CUDA_VISIBLE_DEVICES which is automatically defined as list(range(pynvml.nvmlDeviceGetCount())), and to be honest I haven't had the chance to work with MIG yet and don't know exactly what happens when you have MIG devices available, are they treated just as one more device? For example, imagine there's a system with 2 GPUs configured as follows:

GPU 0; ** MIG 0; ** MIG 1;
GPU 1.

In the case above, will there 3 indices that can be passed to CUDA_VISIBLE_DEVICES, where 0->GPU 0 MIG 0, 1->GPU 0 MIG 1, 2->GPU 1? And if so, are we able to reliably identify which ones are MIG and which are just regular devices?

Jul 12 '21 15:07 pentschev

I tested a few things. I have used a VM on AWS which has 8 A100 GPUs. I enabled MIG on GPU 0 and divided that into 7 5GB instances.

MIG instances configuration.

ubuntu@ip-172-31-48-89:~$ sudo nvidia-smi -mig 1 -i 0
Enabled MIG Mode for GPU 00000000:10:1C.0
All done.
ubuntu@ip-172-31-48-89:~$ sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -i 0
Successfully created GPU instance ID  9 on GPU  0 using profile MIG 1g.5gb (ID 19)
Successfully created GPU instance ID  7 on GPU  0 using profile MIG 1g.5gb (ID 19)
Successfully created GPU instance ID  8 on GPU  0 using profile MIG 1g.5gb (ID 19)
Successfully created GPU instance ID 11 on GPU  0 using profile MIG 1g.5gb (ID 19)
Successfully created GPU instance ID 12 on GPU  0 using profile MIG 1g.5gb (ID 19)
Successfully created GPU instance ID 13 on GPU  0 using profile MIG 1g.5gb (ID 19)
Successfully created GPU instance ID 14 on GPU  0 using profile MIG 1g.5gb (ID 19)
ubuntu@ip-172-31-48-89:~$ sudo nvidia-smi mig -i 0 -cci -gi 7,8,9,11,12,13,14
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  7 using profile MIG 1g.5gb (ID  0)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  8 using profile MIG 1g.5gb (ID  0)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  9 using profile MIG 1g.5gb (ID  0)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID 11 using profile MIG 1g.5gb (ID  0)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID 12 using profile MIG 1g.5gb (ID  0)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID 13 using profile MIG 1g.5gb (ID  0)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID 14 using profile MIG 1g.5gb (ID  0)
ubuntu@ip-172-31-48-89:~$
ubuntu@ip-172-31-48-89:~$ nvidia-smi
Mon Jul 12 22:17:40 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03   Driver Version: 450.119.03   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:10:1C.0 Off |                   On |
| N/A   40C    P0    47W / 400W |    102MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:10:1D.0 Off |                    0 |
| N/A   40C    P0    56W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:20:1C.0 Off |                    0 |
| N/A   41C    P0    57W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:20:1D.0 Off |                    0 |
| N/A   37C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:90:1C.0 Off |                    0 |
| N/A   40C    P0    55W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:1D.0 Off |                    0 |
| N/A   37C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:A0:1C.0 Off |                    0 |
| N/A   42C    P0    55W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:A0:1D.0 Off |                    0 |
| N/A   40C    P0    59W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |     80MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      4MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0    7    0       5239      C   ...s/rapids-21.06/bin/python       73MiB |
+-----------------------------------------------------------------------------+

Some of the tests are done on a notebook running on bare VM. For other tests, I am using the rapids 21.06 docker container where I restrict which GPUs the container can see using the --gpus flag. I will describe the setup appropriately as needed.

Observations:

Currently, LocalCUDACluster requires CUDA_VISIBLE_DEVICES argument to have MIG-GPU- prefix if we want to specify MIG instances: https://github.com/rapidsai/dask-cuda/blob/branch-21.08/dask_cuda/utils.py#L467 . Non MIG gpus can be specified via integers or with a prefix GPU-.

LocalCUDACluster fails when I try to use MIG instances by specifying the MIG enabled GPU by its index CUDA_VISIBLE_DEVICES="0". This is directly on the VM.

Expand to see Error Details.

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0")
cluster
---------------------------------------------------------------------------
tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 15 memory: 306 MB fds: 53>>
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/distributed/system_monitor.py", line 99, in update
    gpu_metrics = nvml.real_time()
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 38, in real_time
    "utilization": pynvml.nvmlDeviceGetUtilizationRates(h).gpu,
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/pynvml/nvml.py", line 2058, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

Note: If we test the same by attaching GPU 0 by index to a docker container via : docker run --gpus '"device=0"' --rm -it rapidsai/rapidsai:21.06-cuda11.0-runtime-ubuntu18.04-py3.8 we get the same error as in the next bullet point.

LocalCUDACluster fails when I try to use MIG instances from inside a docker container (a case similar to when we run things with GKE or EKS). I start the docker container with docker run --gpus '"device=0:0,0:1,0:2"' --rm -it rapidsai/rapidsai:21.06-cuda11.0-runtime-ubuntu18.04-py3.8 to allow the container to see only the 1st, 2nd and 3rd MIG instance of GPU 0.

Expand to see Error Details.

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="MIG-GPU-0:0,MIG-GPU-0:1,MIG-GPU-0:2")
cluster
---------------------------------------------------------------------------
NVMLError_NoPermission                    Traceback (most recent call last)
<ipython-input-2-7a3566f39e2f> in <module>
----> 1 cluster = (CUDA_VISIBLE_DEVICES="MIG-GPU-0:0,MIG-GPU-0:1,MIG-GPU-0:2")
    2 cluster

/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cuda/local_cuda_cluster.py in __init__(self, CUDA_VISIBLE_DEVICES, n_workers, threads_per_worker, memory_limit, device_memory_limit, data, local_directory, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, ucx_net_devices, rmm_pool_size, rmm_managed_memory, rmm_async, rmm_log_directory, jit_unspill, log_spilling, **kwargs)
    214             memory_limit, threads_per_worker, n_workers
    215         )
--> 216         self.device_memory_limit = parse_device_memory_limit(
    217             device_memory_limit, device_index=0
    218         )

/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cuda/utils.py in parse_device_memory_limit(device_memory_limit, device_index)
    525         device_memory_limit = float(device_memory_limit)
    526         if isinstance(device_memory_limit, float) and device_memory_limit <= 1:
--> 527             return int(get_device_total_memory(device_index) * device_memory_limit)
    528 
    529     if isinstance(device_memory_limit, str):

/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cuda/utils.py in get_device_total_memory(index)
    185     """
    186     pynvml.nvmlInit()
--> 187     return pynvml.nvmlDeviceGetMemoryInfo(
    188         pynvml.nvmlDeviceGetHandleByIndex(index)
    189     ).total

/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetMemoryInfo(handle)
1982     fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
1983     ret = fn(handle, byref(c_memory))
-> 1984     _nvmlCheckReturn(ret)
1985     return c_memory
1986 

/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
    741 def _nvmlCheckReturn(ret):
    742     if (ret != NVML_SUCCESS):
--> 743         raise NVMLError(ret)
    744     return ret
    745 

NVMLError_NoPermission: Insufficient Permissions

This error goes away if I made the changes mentioned in https://github.com/rapidsai/dask-cuda/issues/583#issuecomment-878349249 in nvmlDeviceGetMemoryInfo. But nvmlDeviceGetMemoryInfo needs both the handle of the parent GPU and the MIG instance index. These are not passed in correctly at the moment however we are not getting the permissions error. Hence we will need to handle these changes in dask-cuda code.

Expand to see Image.

LocalCUDACluster fails when I try to use MIG instances from directly without docker, but with a different error if I use CUDA_VISIBLE_DEVICES tto denote the MIG instances. Need to investigate further.

Expand to see Error Details.

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="MIG-GPU-0:0,MIG-GPU-0:1,MIG-GPU-0:2")
cluster
---------------------------------------------------------------------------
    Unable to start CUDA Context
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 237, in initialize
    self.cuInit(0)
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 300, in safe_cuda_api_call
    self._check_error(fname, retcode)
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 335, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [100] Call to cuInit results in CUDA_ERROR_NO_DEVICE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/dask_cuda/initialize.py", line 142, in dask_setup
    numba.cuda.current_context()
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 212, in get_context
    return _runtime.get_or_create_context(devnum)
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context
    return self._get_or_create_context_uncached(devnum)
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 151, in _get_or_create_context_uncached
    with driver.get_active_context() as ac:
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 393, in __enter__
    driver.cuCtxGetCurrent(byref(hctx))
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 280, in __getattr__
    self.initialize()
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 240, in initialize
    raise CudaSupportError("Error at driver init: \n%s:" % e)
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: 
[100] Call to cuInit results in CUDA_ERROR_NO_DEVICE:
Unable to start CUDA Context
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 237, in initialize
    self.cuInit(0)
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 300, in safe_cuda_api_call
    self._check_error(fname, retcode)
File "/home/ubuntu/miniconda3/envs/rapids-21.06/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 335, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [100] Call to cuInit results in CUDA_ERROR_NO_DEVICE

LocalCUDACluster succeeds if I try to use non-MIG instances from directly with/without docker.

Based on these pocs, there appear to be some existing discrepancies. We think that we need to first properly identify what type of device each device is in CUDA_VISIBLE_DEVICES. Once we do that, we then need to query the GPUs with the right NVML call via right pynvml api in several places such as get_cpu_affinity, get_device_total_memory, etc.

Action Plan after discussion with @pentschev :

Firstly mapping the MIG counterparts for the pynvml api we use in dask_cuda/utils.py. We should be able to write a is_mig_device utils function which will parse a device index and return whether it is a MIG device or not. This can be subsequently used in get_cpu_affinity, get_device_total_memory to use the correct pynvml apis.
Secondly, add more user-friendly error when trying to start a CUDA worker on a MIG-enabled device. See error 2 above.
Thirdly, add handling of default Dask-CUDA setup when we use a hybrid deployment of MIG enabled and disabled GPUs. Suppose we have a deployment where user wants to have the following configuration:
- GPU 0: (MIG enabled)
  - MIG 0
  - MIG 1
- GPU1 (MIG not enabled)
Three possible solution approaches are applicable in such a scenario: a. We rely on the default behavior and create workers only on the non-MIG devices and just create MIG devices when explicitly specified via CUDA_VISIBLE_DEVICES b. Add a new argument --mig that will create workers using all MIG devices (and ignore the non MIG ones), where the default behavior (when --mig is NOT specified) would be to create workers on all non-MIG devices. c. Create 3 workers with 3 completely different memory sizes and characteristics. Generally a bad idea.

This perhaps need much more discussion before we do something.

Jul 13 '21 00:07 akaanirban

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Nov 23 '21 20:11 github-actions[bot]

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Nov 23 '21 20:11 github-actions[bot]

Based on https://github.com/rapidsai/dask-cuda/pull/674 , it sounds like this may have been resolved. Is this still an issue, or can it be closed?

Jan 06 '22 15:01 beckernick

I believe this can be closed. @pentschev can attest please?

On Thu, Jan 6, 2022, 10:36 AM Nick Becker @.***> wrote:

Based on #674 https://github.com/rapidsai/dask-cuda/pull/674 , it sounds like this may have been resolved. Is this still an issue, or can it be closed?

— Reply to this email directly, view it on GitHub https://github.com/rapidsai/dask-cuda/issues/583#issuecomment-1006686194, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFH6YA42S4SO5FUEACLBU6LUUWZHNANCNFSM43GU4X4A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jan 06 '22 15:01 akaanirban

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Feb 05 '22 16:02 github-actions[bot]

@pentschev are we ok to close this? Sounds like yes from the conversation above, but please let us know if something is still missing here

Apr 13 '22 05:04 jakirkham

I think we want to keep it open still because not all parts of the action plan in https://github.com/rapidsai/dask-cuda/issues/583#issuecomment-878675364 were done. It would be good to have them addressed at some point if it really becomes a priority and someone has bandwidth to work on those.

Apr 13 '22 12:04 pentschev

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

May 13 '22 13:05 github-actions[bot]

We're still seeing this issue when running the latest Merlin image (nvcr.io/nvidia/merlin/merlin-pytorch:22.06), which includes CUDA 11.7, dask-cuda==22.04, and pynvml==11.4.1. Happens on both driver 515.48.07 and 510.47.03 if that makes any difference.

In [1]: from dask_cuda import LocalCUDACluster

In [2]: cluster = LocalCUDACluster("MIG-e65035fb-733c-5aeb-9a88-e20f5f0cb0b5")
---------------------------------------------------------------------------
NVMLError_NoPermission                    Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 cluster = LocalCUDACluster("MIG-e65035fb-733c-5aeb-9a88-e20f5f0cb0b5")

File /usr/local/lib/python3.8/dist-packages/dask_cuda/local_cuda_cluster.py:337, in LocalCUDACluster.__init__(self, CUDA_VISIBLE_DEVICES, n_workers, threads_per_worker, memory_limit, device_memory_limit, data, local_directory, shared_filesystem, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, rmm_pool_size, rmm_maximum_pool_size, rmm_managed_memory, rmm_async, rmm_log_directory, rmm_track_allocations, jit_unspill, log_spilling, worker_class, pre_import, **kwargs)
    330     worker_class = partial(
    331         LoggedNanny if log_spilling is True else Nanny,
    332         worker_class=worker_class,
    333     )
    335 self.pre_import = pre_import
--> 337 super().__init__(
    338     n_workers=0,
    339     threads_per_worker=threads_per_worker,
    340     memory_limit=self.memory_limit,
    341     processes=True,
    342     data=data,
    343     local_directory=local_directory,
    344     protocol=protocol,
    345     worker_class=worker_class,
    346     config={
    347         "distributed.comm.ucx": get_ucx_config(
    348             enable_tcp_over_ucx=enable_tcp_over_ucx,
    349             enable_nvlink=enable_nvlink,
    350             enable_infiniband=enable_infiniband,
    351             enable_rdmacm=enable_rdmacm,
    352         )
    353     },
    354     **kwargs,
    355 )
    357 self.new_spec["options"]["preload"] = self.new_spec["options"].get(
    358     "preload", []
    359 ) + ["dask_cuda.initialize"]
    360 self.new_spec["options"]["preload_argv"] = self.new_spec["options"].get(
    361     "preload_argv", []
    362 ) + ["--create-cuda-context"]

File /usr/local/lib/python3.8/dist-packages/distributed/deploy/local.py:236, in LocalCluster.__init__(self, name, n_workers, threads_per_worker, processes, loop, start, host, ip, scheduler_port, silence_logs, dashboard_address, worker_dashboard_address, diagnostics_port, services, worker_services, service_kwargs, asynchronous, security, protocol, blocked_handlers, interface, worker_class, scheduler_kwargs, scheduler_sync_interval, **worker_kwargs)
    233 worker = {"cls": worker_class, "options": worker_kwargs}
    234 workers = {i: worker for i in range(n_workers)}
--> 236 super().__init__(
    237     name=name,
    238     scheduler=scheduler,
    239     workers=workers,
    240     worker=worker,
    241     loop=loop,
    242     asynchronous=asynchronous,
    243     silence_logs=silence_logs,
    244     security=security,
    245     scheduler_sync_interval=scheduler_sync_interval,
    246 )

File /usr/local/lib/python3.8/dist-packages/distributed/deploy/spec.py:260, in SpecCluster.__init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close, scheduler_sync_interval)
    258 if not self.asynchronous:
    259     self._loop_runner.start()
--> 260     self.sync(self._start)
    261     try:
    262         self.sync(self._correct_state)

File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    307     return future
    308 else:
--> 309     return sync(
    310         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    311     )

File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376, in sync(loop, func, callback_timeout, *args, **kwargs)
    374 if error:
    375     typ, exc, tb = error
--> 376     raise exc.with_traceback(tb)
    377 else:
    378     return result

File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349, in sync.<locals>.f()
    347         future = asyncio.wait_for(future, callback_timeout)
    348     future = asyncio.ensure_future(future)
--> 349     result = yield future
    350 except Exception:
    351     error = sys.exc_info()

File /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File /usr/local/lib/python3.8/dist-packages/distributed/deploy/spec.py:292, in SpecCluster._start(self)
    290     if isinstance(cls, str):
    291         cls = import_term(cls)
--> 292     self.scheduler = cls(**self.scheduler_spec.get("options", {}))
    293     self.scheduler = await self.scheduler
    294 self.scheduler_comm = rpc(
    295     getattr(self.scheduler, "external_address", None) or self.scheduler.address,
    296     connection_args=self.security.get_connection_args("client"),
    297 )

File /usr/local/lib/python3.8/dist-packages/distributed/scheduler.py:3983, in Scheduler.__init__(self, loop, delete_interval, synchronize_worker_interval, services, service_kwargs, allowed_failures, extensions, validate, scheduler_file, security, worker_ttl, idle_timeout, interface, host, port, protocol, dashboard_address, dashboard, http_prefix, preload, preload_argv, plugins, **kwargs)
   3924 self.handlers = {
   3925     "register-client": self.add_client,
   3926     "scatter": self.scatter,
   (...)
   3978     "dump_cluster_state_to_url": self.dump_cluster_state_to_url,
   3979 }
   3981 connection_limit = get_fileno_limit() / 2
-> 3983 super().__init__(
   3984     # Arguments to SchedulerState
   3985     aliases=aliases,
   3986     clients=clients,
   3987     workers=workers,
   3988     host_info=host_info,
   3989     resources=resources,
   3990     tasks=tasks,
   3991     unrunnable=unrunnable,
   3992     validate=validate,
   3993     plugins=plugins,
   3994     # Arguments to ServerNode
   3995     handlers=self.handlers,
   3996     stream_handlers=merge(worker_handlers, client_handlers),
   3997     io_loop=self.loop,
   3998     connection_limit=connection_limit,
   3999     deserialize=False,
   4000     connection_args=self.connection_args,
   4001     **kwargs,
   4002 )
   4004 if self.worker_ttl:
   4005     pc = PeriodicCallback(self.check_worker_ttl, self.worker_ttl * 1000)

File /usr/local/lib/python3.8/dist-packages/distributed/scheduler.py:2105, in SchedulerState.__init__(self, aliases, clients, workers, host_info, resources, tasks, unrunnable, validate, plugins, **kwargs)
   2102 self._transition_counter = 0
   2104 # Call Server.__init__()
-> 2105 super().__init__(**kwargs)

File /usr/local/lib/python3.8/dist-packages/distributed/core.py:191, in Server.__init__(self, handlers, blocked_handlers, stream_handlers, connection_limit, deserialize, serializers, deserializers, connection_args, timeout, io_loop)
    189 self._comms = {}
    190 self.deserialize = deserialize
--> 191 self.monitor = SystemMonitor()
    192 self.counters = None
    193 self.digests = None

File /usr/local/lib/python3.8/dist-packages/distributed/system_monitor.py:59, in SystemMonitor.__init__(self, n)
     56     self.quantities["num_fds"] = self.num_fds
     58 if nvml.device_get_count() > 0:
---> 59     gpu_extra = nvml.one_time()
     60     self.gpu_name = gpu_extra["name"]
     61     self.gpu_memory_total = gpu_extra["memory-total"]

File /usr/local/lib/python3.8/dist-packages/distributed/diagnostics/nvml.py:139, in one_time()
    136 def one_time():
    137     h = _pynvml_handles()
    138     return {
--> 139         "memory-total": _get_memory_total(h),
    140         "name": _get_name(h),
    141     }

File /usr/local/lib/python3.8/dist-packages/distributed/diagnostics/nvml.py:116, in _get_memory_total(h)
    114 def _get_memory_total(h):
    115     try:
--> 116         return pynvml.nvmlDeviceGetMemoryInfo(h).total
    117     except pynvml.NVMLError_NotSupported:
    118         return None

File /usr/local/lib/python3.8/dist-packages/pynvml/nvml.py:2063, in nvmlDeviceGetMemoryInfo(handle)
   2061 fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
   2062 ret = fn(handle, byref(c_memory))
-> 2063 _nvmlCheckReturn(ret)
   2064 return c_memory

File /usr/local/lib/python3.8/dist-packages/pynvml/nvml.py:765, in _nvmlCheckReturn(ret)
    763 def _nvmlCheckReturn(ret):
    764     if (ret != NVML_SUCCESS):
--> 765         raise NVMLError(ret)
    766     return ret

NVMLError_NoPermission: Insufficient Permissions

Jul 05 '22 19:07 neggert

@neggert this error seems to be coming from the Dask dashboard. I think that was really never tested with the most recent additions, and indeed the NVML diagnostics in Dask dashboard will not work with MIG currently. However, you may be able to disable it with:

with dask.config.set({"distributed.diagnostics.nvml": False}):
    cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="MIG-<uuid>")

Jul 05 '22 19:07 pentschev

That didn't work, but setting the environment variable export DASK_DISTRIBUTED__DIAGNOSTICS__NVML=False did. Thanks for pointing me in the right direction.

Jul 05 '22 20:07 neggert

That didn't work, but setting the environment variable export DASK_DISTRIBUTED__DIAGNOSTICS__NVML=False did. Thanks for pointing me in the right direction.

This usually indicates a bug in the way dask config options are handled at import time. In this case it is because importing distributed runs distributed.diagnostics.nvml.device_get_count() which initialises nvml before the config option as suggested by @pentschev can be set.

In fact, even if this is fixed there's another pitfall that the following:

with dask.config.set({"distributed.diagnostics.nvml": False}):
    cluster = LocalCUDACluster("MIG-...")
do_stuff_with_cluster(cluster)

may well later try to initialise nvml as well again, since the "once-only" initialisation is not actually once-only if the first initialisation took place with nvml diagnostics switched off. Given the name, one might expect that init_once only runs initialisation and makes a decision the first time it is called, but:

export DASK_DISTRIBUTED__DIAGNOSTICS__NVML=False
In [1]: import dask

In [2]: from distributed.diagnostics import nvml

In [3]: nvml.nvmlInitialized
Out[3]: False

In [4]: nvml.init_once()

In [5]: nvml.nvmlInitialized
Out[5]: False

In [6]: with dask.config.set({"distributed.diagnostics.nvml": True}):
   ...:     nvml.init_once()
   ...: 

In [7]: nvml.nvmlInitialized
Out[7]: True # Huh?

I'll try and find some time to handle this properly in distributed.

Jul 06 '22 09:07 wence-

Good catch @wence- , it seems this is a problem in https://github.com/dask/distributed/blob/75f4635b05034eba890c90d2f829c3672f59e017/distributed/diagnostics/nvml.py#L32-L34 , where one could indeed set to True later on and initialize, even if the first time that was False.

For a bit of context, the problem there is that there are so many different ways NVML may fail for different setups (e.g., pynvml is installed but not the NVIDIA driver, or both are installed but there are no GPUs on the system, etc.) that is hard to make sure that it works everywhere, and now MIG is a whole new issue by itself (which is why it isn't currently supported). Over the past year or so we had that code path changed probably a dozen times -- mostly to cover WSL2 cases -- but it is impossible to realistically test it for all combinations, and thus issues like this may be introduced from time to time.

Jul 06 '22 10:07 pentschev

I'll try and find some time to handle this properly in distributed.

dask/distributed#6678

Jul 06 '22 12:07 wence-

Please pardon my ignorance, but am I seeing the same (or similar) thing here:

Unable to start CUDA Context
Traceback (most recent call last):
  File "/home/perth/w47686/.conda/envs/rapids/lib/python3.9/site-packages/dask_cuda/initialize.py", line 31, in _create_cuda_context
    distributed.comm.ucx.init_once()
  File "/home/perth/w47686/.conda/envs/rapids/lib/python3.9/site-packages/distributed/comm/ucx.py", line 104, in init_once
    cuda_visible_device = int(
ValueError: invalid literal for int() with base 10: 'MIG-41518e05-dfc8-5485-a5f7-8948b6c213a4'

This is with dask_cuda=22.06.00, pynvml=11.4.1, and distributed=2022.05.2.

Jul 12 '22 08:07 hendeb

@hendeb the error you're seeing is different, it's coming from Dask-CUDA, and thus not the Dask scheduler. Could you post also how you're starting up the cluster and a minimal reproducer of the client code?

Jul 12 '22 08:07 pentschev

OK, thanks for taking the time to reply, @pentschev. I am starting up the cluster as per:

This is using MIG devices created on a pair of A100s:

Jul 12 '22 08:07 hendeb

@hendeb I think the issues should be fixed by https://github.com/dask/distributed/pull/6720 and https://github.com/rapidsai/dask-cuda/pull/950 . If you have the chance, could you try both PRs and report back? Please note that you'll need to switch to RAPIDS 22.08 nightly builds, and then install those two PRs from source.

Jul 12 '22 21:07 pentschev

Apologies for the slow reply, @pentschev, and thank you for your help. Using RAPIDS 22.08 nightly build with PRs dask/distributed#6720 and rapidsai/dask-cuda#950 seems to be OK:

Something to maybe note is that the A100 driver used here has been updated since my last post (from 495.29.05 to 515.48.07).

Jul 14 '22 02:07 hendeb

The newer driver version should be ok. Thanks @hendeb for confirming that it worked.

Jul 14 '22 08:07 pentschev

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Aug 13 '22 09:08 github-actions[bot]

dask-cuda dask-cuda copied to clipboard

[BUG] LocalCUDACluster doesn't work with NVIDIA MIG

Observations:

Action Plan after discussion with @pentschev :

dask-cuda
dask-cuda copied to clipboard