level-zero icon indicating copy to clipboard operation
level-zero copied to clipboard

Level-zero does not provide file name for device

Open eero-t opened this issue 2 years ago • 11 comments

Neither core, nor sysman device structures provide the device file name that corresponds to the L0 device.

This information would be useful e.g. with GPU metrics providers running under Kubernetes because:

  • Kubernetes device plugins use device file paths/names in device resource mapping, because CRI API uses those to specify devices to container runtimes: https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1/api.pb.go#L4041
  • Kubernetes scheduler custom rules could then map these metrics easily to correct devices (doing that instead e.g. based on device PCI address provided by Sysman would be fragile)

Currently one needs to scan devfs & sysfs and map that information e.g. L0 device BDF information [1], which is awkward as L0 otherwise abstracts HW from the application.

[1] In shell, device file name <-> BDF mapping would work like this for Intel GPUs:

$ for path in /dev/dri/card*; do
  card=$(basename $path);
  echo $card:;
  grep ^PCI_SLOT_NAME /sys/class/drm/$card/device/uevent;
done

card0:
PCI_SLOT_NAME=0000:07:00.0

(Kubernetes exposes only subset of devices available on host to container devfs, so this starts from that.)

eero-t avatar Nov 03 '21 16:11 eero-t

@eero-t we return the device name here: https://spec.oneapi.io/level-zero/latest/core/api.html?highlight=zedevicegetproperties#_CPPv4N22ze_device_properties_t4nameE

char name[ZE_MAX_DEVICE_NAME]
[out] Device name

jandres742 avatar Nov 03 '21 18:11 jandres742

At least Intel compute-runtime level-zero GPU backend provides there [1] strings like this "Intel(R) Iris(R) Xe MAX Graphics [0x4905]". Whereas corresponding device file name should look like this "/dev/dri/card<-x>".

[1] Code:

if (ret = zesDeviceGetProperties(dev, &props), ret == ZE_RESULT_SUCCESS) {
    const ze_device_properties_t *core = &props.core;
    printf("- name: %s", core->name);
}

eero-t avatar Nov 04 '21 09:11 eero-t

@eero-t Ah, you want the sysfs fileentry. Yes, we are not returning it, but I guess we could look into do it. Will start internal discussion.

jandres742 avatar Nov 04 '21 14:11 jandres742

Additional complication is that there are (typically) two devices associated with each (GPU) device, "cardX" (starting from X=0) and "renderDXXX" (starting from XXX=128). Numbering of these do not always go in sync, because PCI slots may also have devices that provide only on of those device files.

=> The potential new API should probably provide both of these device file names.

Any idea what the device file names look like (and how many of them are) on other kernels than Linux (e.g. Windows and BSD / Mach i.e. Mac)?

eero-t avatar Nov 08 '21 15:11 eero-t

Instead of device file name, it could also be device file index (card: 0, render: 128).

eero-t avatar Nov 08 '21 16:11 eero-t

@eero-t For Linux, renderDXX and cardX may be associated with same GPU but AFAIK each of them has different privileges. We use renderDXXX as UMD shouldn't use cardX. I don't remember what exactly was the reason here but renderDXXX should be good enough for UMD. For Windows, we open adapter which is queried from DXGI/DXCORE. There is (typically) only one entry associated to a given GPU.

BTW. could you explain why BDF is not enough, what is missing?

JablonskiMateusz avatar Nov 08 '21 16:11 JablonskiMateusz

We use renderDXXX as UMD shouldn't use cardX.

Ok, so from Sysman point of view it does make less sense to report "cardX" device file name than "renderDXXX", although both are user-space visible properties of the same GPU device.

BTW. could you explain why BDF is not enough, what is missing?

Because Intel Kubernetes GPU device plugin uses (cardX) device file names as GPU device identifiers, and BDFs do not reliably map to those across multiple machines (all GPU devices of specific types are not necessarily in same slots on all machines). Therefore Intel GPU metrics (used e.g. by GPU aware job scheduling in Kubernetes) need to be labeled with device file names.

  • Nvidia plugin seems to use UUID:
    • https://github.com/NVIDIA/k8s-device-plugin/blob/master/cmd/nvidia-device-plugin/server.go#L270
  • And AMD plugin BDF:
    • https://github.com/RadeonOpenCompute/k8s-device-plugin/blob/master/cmd/k8s-device-plugin/main.go#L128
    • https://github.com/RadeonOpenCompute/k8s-device-plugin/blob/master/internal/pkg/amdgpu/amdgpu.go#L87

But basically Kubernetes device plugin device ID can be anything (to help debugging, it's better to be some externally visible property of the GPU device, so one knows which device is in question).

PS. Device file name used by Intel GPU plugin is more "user friendly" device identifier than BDF or UUID, and requires writing less custom rules for things that need to cover all GPU devices in the whole cluster, but still uniquely identify each GPU within each node (because set of all possible values for X in cardX is smaller than possible values for BDFs and UUIDs).

eero-t avatar Nov 08 '21 18:11 eero-t

I see that AMD plugin BDF implementation is strongly vendor specific as it bases on /sys/module/amdgpu/drivers. However, isn't BDF the most generic mechanism, which is independent on vendor? e.g. for Intel integrated GPU where BDF is 00:02.0 I can query card number by listing symlinks in /sys/dev/char/ in this way:

$ ll /sys/dev/char/ | grep 0000:00:02.0 | grep drm/card
lrwxrwxrwx 1 root root 0 lis  8 21:07 226:0 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0

could you check if this method applies also to other vendors that expose device's BDF and to Kubernetes scenarios?

JablonskiMateusz avatar Nov 08 '21 20:11 JablonskiMateusz

/sys/dev/char/ seems to work also for amdgpu.

This is from Intel HadesCanyon NUC which includes AMD Vega in same package with KBL iGPU:

$ ls -l /sys/dev/char/ | grep drm
lrwxrwxrwx 1 root root 0 marras  9 08:35 226:0 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0
lrwxrwxrwx 1 root root 0 marras  9 08:35 226:1 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/drm/card1
lrwxrwxrwx 1 root root 0 marras  9 08:35 226:128 -> ../../devices/pci0000:00/0000:00:02.0/drm/renderD128
lrwxrwxrwx 1 root root 0 marras  9 08:35 226:129 -> ../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/drm/renderD129

$ head /sys/class/drm/{card[0-9],renderD*}/device/uevent
==> /sys/class/drm/card0/device/uevent <==
DRIVER=i915
PCI_CLASS=38000
PCI_ID=8086:591B
PCI_SUBSYS_ID=8086:2073
PCI_SLOT_NAME=0000:00:02.0
MODALIAS=pci:v00008086d0000591Bsv00008086sd00002073bc03sc80i00

==> /sys/class/drm/card1/device/uevent <==
DRIVER=amdgpu
PCI_CLASS=30000
PCI_ID=1002:694C
PCI_SUBSYS_ID=8086:2073
PCI_SLOT_NAME=0000:01:00.0
MODALIAS=pci:v00001002d0000694Csv00008086sd00002073bc03sc00i00

==> /sys/class/drm/renderD128/device/uevent <==
DRIVER=i915
PCI_CLASS=38000
PCI_ID=8086:591B
PCI_SUBSYS_ID=8086:2073
PCI_SLOT_NAME=0000:00:02.0
MODALIAS=pci:v00008086d0000591Bsv00008086sd00002073bc03sc80i00

==> /sys/class/drm/renderD129/device/uevent <==
DRIVER=amdgpu
PCI_CLASS=30000
PCI_ID=1002:694C
PCI_SUBSYS_ID=8086:2073
PCI_SLOT_NAME=0000:01:00.0
MODALIAS=pci:v00001002d0000694Csv00008086sd00002073bc03sc00i00

Above was with yesterday "drm-tip" kernel.

Note that e.g. with older Ubuntu kernels (5.11 from 20.04), /sys/dev/char/ can apparently have also the display connection items from /sys/class/drm/:

bxt-nuc:~$ ls -l /sys/dev/char/ | grep drm
lrwxrwxrwx 1 root root 0 marras  9 15:55 226:0 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0
lrwxrwxrwx 1 root root 0 marras  9 15:55 226:128 -> ../../devices/pci0000:00/0000:00:02.0/drm/renderD128
lrwxrwxrwx 1 root root 0 marras  9 15:55 235:0 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-DP-1/drm_dp_aux0
lrwxrwxrwx 1 root root 0 marras  9 15:55 235:1 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-DP-2/drm_dp_aux1
lrwxrwxrwx 1 root root 0 marras  9 15:55 89:5 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-DP-1/i2c-5/i2c-dev/i2c-5
lrwxrwxrwx 1 root root 0 marras  9 15:55 89:6 -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-DP-2/i2c-6/i2c-dev/i2c-6

eero-t avatar Nov 09 '21 14:11 eero-t

Once we already expose BDF, would that mechanism work for you?

JablonskiMateusz avatar Nov 10 '21 17:11 JablonskiMateusz

Currently I am mapping device file name to Sysman device by using Sysman reported device BDF [1]. This ticket was just to find out whether device file name could be provided by Sysman, but I guess the answer is becoming "no"?

[1] I scan devfs for device names and index that data by BDF listed for the device in sysfs. Then use that mapping to add device file name (index) to metrics data provided by Sysman.

eero-t avatar Nov 10 '21 17:11 eero-t

Closing as L0 is not going to provide this info.

eero-t avatar Nov 21 '23 10:11 eero-t