[Enhancement] Backward compatible NVML Python bindings
Runtime Environment
- Operating system and version: Ubuntu 20.04 LTS
- Terminal emulator and version: GNOME Terminal 3.36.2
- Python version:
3.9.13 - NVML version (driver version):
470.129.06 nvitopversion or commit:v0.7.1python-ml-pyversion:11.450.51- Locale:
en_US.UTF-8
Context
The official NVML Python bindings (PyPI package nvidia-ml-py) do not guarantee backward compatibility for different NVIDIA drivers. For example, NVML added nvmlDeviceGetComputeRunningProcesses_v2 and nvmlDeviceGetGraphicsRunningProcesses_v2 in CUDA 11.x drivers (R450+). But the package nvidia-ml-py arbitrary call the latest version of the function in the unversioned function:
def nvmlDeviceGetComputeRunningProcesses_v2(handle):
# first call to get the size
c_count = c_uint(0)
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
ret = fn(handle, byref(c_count), None)
...
def nvmlDeviceGetComputeRunningProcesses(handle):
return nvmlDeviceGetComputeRunningProcesses_v2(handle);
This will cause NVMLError_FunctionNotFound error on CUDA 10.x drivers (e.g. R430).
Now there are the v3 version of nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses functions come with the R510+ drivers. E.g., in nvidia-ml-py==11.515.48:
def nvmlDeviceGetComputeRunningProcesses_v3(handle):
# first call to get the size
c_count = c_uint(0)
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
ret = fn(handle, byref(c_count), None)
...
def nvmlDeviceGetComputeRunningProcesses(handle):
return nvmlDeviceGetComputeRunningProcesses_v3(handle)
The v2 version of c_nvmlMemory_v2_t is appearing on the horizon (not found in R510 driver yet). This causes issue #13.
class c_nvmlMemory_t(_PrintableStructure):
_fields_ = [
('total', c_ulonglong),
('free', c_ulonglong),
('used', c_ulonglong),
]
_fmt_ = {'<default>': "%d B"}
class c_nvmlMemory_v2_t(_PrintableStructure):
_fields_ = [
('version', c_uint),
('total', c_ulonglong),
('reserved', c_ulonglong),
('free', c_ulonglong),
('used', c_ulonglong),
]
_fmt_ = {'<default>': "%d B"}
nvmlMemory_v2 = 0x02000028
def nvmlDeviceGetMemoryInfo(handle, version=None):
if not version:
c_memory = c_nvmlMemory_t()
fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
else:
c_memory = c_nvmlMemory_v2_t()
c_memory.version = version
fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")
ret = fn(handle, byref(c_memory))
_nvmlCheckReturn(ret)
return c_memory
Possible Solutions
-
Determine the best dependency version of
nvidia-ml-pyduring installation.This requires the user to install the NVIDIA driver first, which may not be fulfilled on a freshly installed system. Besides, it's hard to list this driver dependency in the package metadata.
-
Wait for the PyPI package
nvidia-ml-pyto become backward compatible.The package
NVIDIA/go-nvmloffers backward compatible APIs:The API is designed to be backwards compatible, so the latest bindings should work with any version of
libnvidia-ml.soinstalled on your system.I posted this on the NVIDIA developer forums [PyPI/nvidia-ml-py] Issue Reports for
nvidia-ml-pybut did not get any official response yet. -
Vender the
nvidia-ml-pyinnvitop. (Note:nvidia-ml-pyis released under the BSD License)This requires bumping the vendered version and making a minor release of
nvitopeach time a new version ofnvidia-ml-pycomes out. -
Automatically patch the
pynvmlmodule when the first call fails when calling the versioned APIs. This can achieve by manipulating the__dict__attribute or themodule.__class__attribute.The goal of this solution is not to make fully backward-compatible Python bindings. That may be out of the scope of
nvitop, e.g.ExcludedDeviceInfo -> BlacklistDeviceInfo. Also, note that this solution may cause performance issues for a much deeper call stack.