nvml-wrapper process_utilization_stats failed with NOT

use nvml_wrapper::Nvml;

fn main() {
    let nvml = Nvml::init().unwrap();
    let device = nvml.device_by_index(0).unwrap();

    let st = device.process_utilization_stats(None).unwrap();
}

cargo run with error:

thread 'main' panicked at src/main.rs:7:53:
called `Result::unwrap()` on an `Err` value: NotFound
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

My device:

Fri Mar 15 07:01:16 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   43C    P8             24W /  350W |       1MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

It's quite strange here, the first call to nvmlDeviceGetProcessUtilization to retrieve proccess count returned 79 in my situation which should be 0.

Mar 15 '24 07:03 tubzby

Some observations:

The problem persists between restarts of the NVML-using program.
If there is only a single compute process running, and I restart it, then the problem temporarily disappears.
Passing in a timestamp makes it far more likely to break. If I always pass in None, then it'll usually keep working for half a minute or so, polling every 2s.
nvtop doesn't have the issue. What do they do differently?
Feels like a driver bug. Does this happen on every GPU?

Jun 01 '24 00:06 Baughn

Here's the relevant nvtop code. Looks pretty different: https://github.com/Syllo/nvtop/blob/0316ce19581c3d8543cf6aa312d1569c56ca754f/src/extract_gpuinfo_nvidia.c#L761

Jun 01 '24 00:06 Baughn

Another observation: Processes appear to only be returned if they are running. An idle process doesn't end up in the array, unless it was non-idle very recently. This accounts for what happens if I set the timestamp -- it reduces the horizon.

Also means that swallowing the error (and returning []) should be a valid workaround.

Jun 01 '24 00:06 Baughn

process_utilization_stats failed with NOT_FOUND error, Ubuntu 22.04