dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

The pod for a given GPU in k8s mode cannot be captured

Open rokkiter opened this issue 3 months ago • 5 comments

What happened?

Unable to collect GPU metrics for relevant pods when using passthrough mode. For example, dcgm-exporter does not collect metrics when a VM created with kubevirt mounts a GPU in passthrough mode.

kubevirt vmi yaml for mounting GPUs

spec:
  domain:
    devices:
      ...
      gpus:
      - deviceName: nvidia.com/GP104GL_TESLA_P4
        name: gpu1

The resource of the kubevirt launcher pod that needs to be monitored.

resources:
  ...
  requests:
    ...
    nvidia.com/GP104GL_TESLA_P4: "1"

I have some GPU cards mounted in my cluster and from kubectl describe node I can get the following information.

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests           Limits
  --------                       --------           ------
  ...
  nvidia.com/GP104GL_TESLA_P4    2                  2
  nvidia.com/GRID_P4-1Q          0                  0
  nvidia.com/GRID_P4-4Q          0                  0

In this case, GPU cards are assigned to pods that will not be able to capture GPU metrics by EXPORT.

In the following code, we can find the rules are resourceName == nvidiaResourceName or strings.HasPrefix(resourceName, nvidiaMigResourcePrefix) The nvidiaResourceName is "nvidia.com/gpu" This filters the mounting of specific devices. https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/kubernetes.go#L142

func (p *PodMapper) toDeviceToPod(
	devicePods *podresourcesapi.ListPodResourcesResponse, sysInfo SystemInfo,
) map[string]PodInfo {
	deviceToPodMap := make(map[string]PodInfo)

	for _, pod := range devicePods.GetPodResources() {
		for _, container := range pod.GetContainers() {
			for _, device := range container.GetDevices() {

				resourceName := device.GetResourceName()
				if resourceName != nvidiaResourceName {
					// Mig resources appear differently than GPU resources
					if !strings.HasPrefix(resourceName, nvidiaMigResourcePrefix) {
						continue
					}
				}
				...
			}
		}
	}

	return deviceToPodMap
}

This appears to be because the DCGM Exporter strictly follows the k8s specification for determining GPU resource. refer k8s device plugin, But it can't cover all scenarios.

The ResourceName it wants to advertise. Here ResourceName needs to follow the extended resource naming scheme as vendor-domain/resourcetype. (For example, an NVIDIA GPU is advertised as nvidia.com/gpu.)

What did you expect to happen?

GPU metrics can be collected when mounting a GPU card using kubevirt passthrough mode

What is the GPU model?

What is the environment?

pod

How did you deploy the dcgm-exporter and what is the configuration?

GPU Operator

How can we reproduce the issue?

Mounting a GPU card using kubevirt passthrough mode.

What is the version?

Latest

Anything else we need to know?

Some discussions in the kubevirt community. https://github.com/kubevirt/kubevirt/issues/11660

rokkiter avatar Apr 12 '24 03:04 rokkiter

@rokkiter , The dcgm-exporter is dependent on https://github.com/NVIDIA/k8s-device-plugin and uses The pod-resources API to read mapping between pods and devices: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/ .

The Kubevirt support is a new environment for us. Can you give us details on setting up an environment to reproduce the issue?

Also, please explain your use case to justify the feature.

nvvfedorov avatar Apr 15 '24 14:04 nvvfedorov

Installation environment reference https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.htm

kubevirt configuration GPU reference https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.html#add-gpu-resources-to-kubevirt-cr

Using Prometheus, I was able to get monitoring information for the pods(not create by pod) in my environment that use navid.com/gpu resources, but not for the pods(created by kubevirt) that use nvidia.com/GRID_P4-1Q.

environment

  • kubevirt: v1.0.0
  • k8s: v1.25.6
  • gpu-operator: v23.9.0

rokkiter avatar Apr 18 '24 03:04 rokkiter

Node configuration supports pass-through mode.

  1. node open IOMMU. refer https://www.server-world.info/en/note?os=CentOS_7&p=kvm&f=10
  2. add label gpu.workload.config=vm-passthrough to node
  3. update gpu-operator config
gpu-operator.sandboxWorkloads.enabled=true
gpu-operator.vfioManager.enabled=true
gpu-operator.sandboxDevicePlugin.enabled=true
gpu-operator.sandboxDevicePlugin.version=v1.2.4   
gpu-operator.toolkit.version=v1.14.3-ubuntu20.04

rokkiter avatar Apr 22 '24 08:04 rokkiter

@rokkiter , thank you for the update and provided details.

nvvfedorov avatar Apr 24 '24 20:04 nvvfedorov

Thanks for focusing on this issue. I recently realized that nodes configured for pass-through mode do not install dcgm-exporter. even if I manually hit the node with the nvidia.com/gpu.deploy.dcgm-exporter=true label, this label is automatically removed! Although it doesn't seem possible to monitor kubevirt vm GPU usage at the moment, it would be nice to have a solution to do so!

rokkiter avatar Apr 28 '24 09:04 rokkiter