[Enhancement] Backward compatible NVML Python bindings

Open XuehaiPan opened this issue 3 years ago • 0 comments

Runtime Environment

Operating system and version: Ubuntu 20.04 LTS
Terminal emulator and version: GNOME Terminal 3.36.2
Python version: 3.9.13
NVML version (driver version): 470.129.06
nvitop version or commit: v0.7.1
python-ml-py version: 11.450.51
Locale: en_US.UTF-8

Context

The official NVML Python bindings (PyPI package nvidia-ml-py) do not guarantee backward compatibility for different NVIDIA drivers. For example, NVML added nvmlDeviceGetComputeRunningProcesses_v2 and nvmlDeviceGetGraphicsRunningProcesses_v2 in CUDA 11.x drivers (R450+). But the package nvidia-ml-py arbitrary call the latest version of the function in the unversioned function:

def nvmlDeviceGetComputeRunningProcesses_v2(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
    ret = fn(handle, byref(c_count), None)

    ...

def nvmlDeviceGetComputeRunningProcesses(handle):
    return nvmlDeviceGetComputeRunningProcesses_v2(handle);

This will cause NVMLError_FunctionNotFound error on CUDA 10.x drivers (e.g. R430).

Now there are the v3 version of nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses functions come with the R510+ drivers. E.g., in nvidia-ml-py==11.515.48:

def nvmlDeviceGetComputeRunningProcesses_v3(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
    ret = fn(handle, byref(c_count), None)

    ...

def nvmlDeviceGetComputeRunningProcesses(handle):
    return nvmlDeviceGetComputeRunningProcesses_v3(handle)

The v2 version of c_nvmlMemory_v2_t is appearing on the horizon (not found in R510 driver yet). This causes issue #13.

class c_nvmlMemory_t(_PrintableStructure):
    _fields_ = [
        ('total', c_ulonglong),
        ('free', c_ulonglong),
        ('used', c_ulonglong),
    ]
    _fmt_ = {'<default>': "%d B"}

class c_nvmlMemory_v2_t(_PrintableStructure):
    _fields_ = [
        ('version', c_uint),
        ('total', c_ulonglong),
        ('reserved', c_ulonglong),
        ('free', c_ulonglong),
        ('used', c_ulonglong),
    ]
    _fmt_ = {'<default>': "%d B"}

nvmlMemory_v2 = 0x02000028

def nvmlDeviceGetMemoryInfo(handle, version=None):
    if not version:
        c_memory = c_nvmlMemory_t()
        fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
    else:
        c_memory = c_nvmlMemory_v2_t()
        c_memory.version = version
        fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")
    ret = fn(handle, byref(c_memory))
    _nvmlCheckReturn(ret)
    return c_memory

Possible Solutions

Determine the best dependency version of nvidia-ml-py during installation.

This requires the user to install the NVIDIA driver first, which may not be fulfilled on a freshly installed system. Besides, it's hard to list this driver dependency in the package metadata.
Wait for the PyPI package nvidia-ml-py to become backward compatible.

The package NVIDIA/go-nvml offers backward compatible APIs:

The API is designed to be backwards compatible, so the latest bindings should work with any version of libnvidia-ml.so installed on your system.

I posted this on the NVIDIA developer forums [PyPI/nvidia-ml-py] Issue Reports for nvidia-ml-py but did not get any official response yet.
Vender the nvidia-ml-py in nvitop. (Note: nvidia-ml-py is released under the BSD License)

This requires bumping the vendered version and making a minor release of nvitop each time a new version of nvidia-ml-py comes out.
Automatically patch the pynvml module when the first call fails when calling the versioned APIs. This can achieve by manipulating the __dict__ attribute or the module.__class__ attribute.

The goal of this solution is not to make fully backward-compatible Python bindings. That may be out of the scope of nvitop, e.g. ExcludedDeviceInfo -> BlacklistDeviceInfo. Also, note that this solution may cause performance issues for a much deeper call stack.

Jul 23 '22 13:07 XuehaiPan