process_utilization_stats failed with NOT_FOUND error, Ubuntu 22.04
use nvml_wrapper::Nvml;
fn main() {
let nvml = Nvml::init().unwrap();
let device = nvml.device_by_index(0).unwrap();
let st = device.process_utilization_stats(None).unwrap();
}
cargo run with error:
thread 'main' panicked at src/main.rs:7:53:
called `Result::unwrap()` on an `Err` value: NotFound
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
My device:
Fri Mar 15 07:01:16 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3080 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 43C P8 24W / 350W | 1MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
It's quite strange here, the first call to nvmlDeviceGetProcessUtilization to retrieve proccess count returned 79 in my situation which should be 0.
Some observations:
- The problem persists between restarts of the NVML-using program.
- If there is only a single compute process running, and I restart it, then the problem temporarily disappears.
- Passing in a timestamp makes it far more likely to break. If I always pass in None, then it'll usually keep working for half a minute or so, polling every 2s.
- nvtop doesn't have the issue. What do they do differently?
- Feels like a driver bug. Does this happen on every GPU?
Here's the relevant nvtop code. Looks pretty different: https://github.com/Syllo/nvtop/blob/0316ce19581c3d8543cf6aa312d1569c56ca754f/src/extract_gpuinfo_nvidia.c#L761
Another observation: Processes appear to only be returned if they are running. An idle process doesn't end up in the array, unless it was non-idle very recently. This accounts for what happens if I set the timestamp -- it reduces the horizon.
Also means that swallowing the error (and returning []) should be a valid workaround.