dask-cuda
dask-cuda copied to clipboard
Relax the pin on pynvml again
Handling the str vs. bytes discrepancy should have been covered by the changes in #1118.
There were two pins in the PR below, but only one unpin in this PR.
- https://github.com/rapidsai/dask-cuda/pull/1128/files
Should pyproject.toml
also be unpinned?
Oh thanks, I suspect so (pushed that change). Thanks for the sharp eyes!
One CI job is failing with this error:
Unable to start CUDA Context
Traceback (most recent call last):
File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/opt/conda/envs/test/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
func = self.__getitem__(name)
File "/opt/conda/envs/test/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/envs/test/lib/python3.8/site-packages/dask_cuda/initialize.py", line 31, in _create_cuda_context
distributed.comm.ucx.init_once()
File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/comm/ucx.py", line 136, in init_once
pre_existing_cuda_context = has_cuda_context()
File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 219, in has_cuda_context
if _running_process_matches(handle):
File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 179, in _running_process_matches
running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
Rerunning CI to see if the Dask 2023.2.1
release helped
Rerunning CI to see if the Dask
2023.2.1
release helped
I imagine the problem is that pynvml has been updated to require a v3 version of a function in nvml, but that doesn't exist in cuda 11.2?
This is WIP until such time as a solution for backwards compat is decided on in nvidia-ml-py (and/or pynvml). So until then we should just keep pynvml at 11.4.1
Going to double check this, but my understanding is we want PyNVML 11.5 for CUDA 12 support
Agreed, but it seems we need that fix to land in nvidia-ml-py first as we can't work around that in a reasonable manner.
Going to double check this, but my understanding is we want PyNVML 11.5 for CUDA 12 support
I don't think that is necessary, unless we need features in nvml that were only introduced in cuda 12.
Specifically, I have CTK 12 on my system, I install pvnvml < 11.5, and all the queries work. The C API preserves backwards compatibility so old versions of pynvml work fine with new versions of libnvidia-ml.so. The problem is the other way round, new versions of pynvml don't work with old versions of libnvidia-ml.so.
This pending resolution of NVBug 4008080.
Curious where things landed here. AFAICT this is still pinned
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
/ok to test
/ok to test