dask-cuda icon indicating copy to clipboard operation
dask-cuda copied to clipboard

Relax the pin on pynvml again

Open wence- opened this issue 2 years ago • 14 comments

Handling the str vs. bytes discrepancy should have been covered by the changes in #1118.

wence- avatar Feb 24 '23 13:02 wence-

There were two pins in the PR below, but only one unpin in this PR.

  • https://github.com/rapidsai/dask-cuda/pull/1128/files

Should pyproject.toml also be unpinned?

ajschmidt8 avatar Feb 24 '23 15:02 ajschmidt8

Oh thanks, I suspect so (pushed that change). Thanks for the sharp eyes!

wence- avatar Feb 24 '23 17:02 wence-

One CI job is failing with this error:

Unable to start CUDA Context
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/opt/conda/envs/test/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/opt/conda/envs/test/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/dask_cuda/initialize.py", line 31, in _create_cuda_context
    distributed.comm.ucx.init_once()
  File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/comm/ucx.py", line 136, in init_once
    pre_existing_cuda_context = has_cuda_context()
  File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 219, in has_cuda_context
    if _running_process_matches(handle):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 179, in _running_process_matches
    running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
    return nvmlDeviceGetComputeRunningProcesses_v3(handle);
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

jakirkham avatar Feb 24 '23 21:02 jakirkham

Rerunning CI to see if the Dask 2023.2.1 release helped

jakirkham avatar Feb 25 '23 04:02 jakirkham

Rerunning CI to see if the Dask 2023.2.1 release helped

I imagine the problem is that pynvml has been updated to require a v3 version of a function in nvml, but that doesn't exist in cuda 11.2?

wence- avatar Feb 25 '23 09:02 wence-

This is WIP until such time as a solution for backwards compat is decided on in nvidia-ml-py (and/or pynvml). So until then we should just keep pynvml at 11.4.1

wence- avatar Feb 28 '23 16:02 wence-

Going to double check this, but my understanding is we want PyNVML 11.5 for CUDA 12 support

jakirkham avatar Feb 28 '23 23:02 jakirkham

Agreed, but it seems we need that fix to land in nvidia-ml-py first as we can't work around that in a reasonable manner.

pentschev avatar Mar 01 '23 08:03 pentschev

Going to double check this, but my understanding is we want PyNVML 11.5 for CUDA 12 support

I don't think that is necessary, unless we need features in nvml that were only introduced in cuda 12.

Specifically, I have CTK 12 on my system, I install pvnvml < 11.5, and all the queries work. The C API preserves backwards compatibility so old versions of pynvml work fine with new versions of libnvidia-ml.so. The problem is the other way round, new versions of pynvml don't work with old versions of libnvidia-ml.so.

wence- avatar Mar 01 '23 10:03 wence-

This pending resolution of NVBug 4008080.

pentschev avatar Jul 28 '23 19:07 pentschev

Curious where things landed here. AFAICT this is still pinned

jakirkham avatar Jun 25 '24 06:06 jakirkham

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Jul 10 '24 15:07 copy-pr-bot[bot]

/ok to test

pentschev avatar Jul 10 '24 15:07 pentschev

/ok to test

pentschev avatar Jul 10 '24 15:07 pentschev