gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

[feature request] Add a way to set pod annotations for dcgm exporter

Open landorg opened this issue 3 years ago • 1 comments

Our monitoring system (datadog) requires us to set pod annotations to the exporter pods. Would be great if you could add a way to set spec.template.metadata.annotations of the daemonset. Thanks

landorg avatar Apr 21 '22 16:04 landorg

we will look into adding this with future releases.

shivamerla avatar Apr 22 '22 18:04 shivamerla

+1 This would be super useful for us too

syandroo avatar Nov 09 '22 21:11 syandroo

I can see daemonsets.annotations here helm -n gpu-operator get values gpu-operator --all (app version v22.9.2), are they intended for this issue usecase?

When I declare these annotations in chart values

daemonsets:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "9400"
    prometheus.io/scrape: "true"

the chart deploys successfully but gpu-operator pod crashes with this error:

{"level":"info","ts":1678700104.895753,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1678700104.8987215,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1678700104.903535,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1678700104.9083395,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1678700104.9132628,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1678700104.9150171,"msg":"Observed a panic in reconciler: assignment to entry in nil map","controller":"clusterpolicy-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"558a2f1a-5f56-41fe-a896-23a7b965c55b"}
panic: assignment to entry in nil map [recovered]
	panic: assignment to entry in nil map

goroutine 893 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118 +0x1f4
panic({0x1902300, 0x1df3cf0})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/NVIDIA/gpu-operator/controllers.applyCommonDaemonsetMetadata(...)
	/workspace/controllers/object_controls.go:589
github.com/NVIDIA/gpu-operator/controllers.preProcessDaemonSet(0xc002288480, {{0x1e0e8f8, 0xc0011953e0}, 0xc000a34000, {0xc00004a053, 0xc}, {0xc00159a000, 0x10, 0x10}, {0xc0003c5680, ...}, ...})
	/workspace/controllers/object_controls.go:567 +0xab8
github.com/NVIDIA/gpu-operator/controllers.DaemonSet({{0x1e0e8f8, 0xc0011953e0}, 0xc000a34000, {0xc00004a053, 0xc}, {0xc00159a000, 0x10, 0x10}, {0xc0003c5680, 0x10, ...}, ...})
	/workspace/controllers/object_controls.go:3099 +0x4a5
github.com/NVIDIA/gpu-operator/controllers.(*ClusterPolicyController).step(0x2b80c40)
	/workspace/controllers/state_manager.go:885 +0x136
github.com/NVIDIA/gpu-operator/controllers.(*ClusterPolicyReconciler).Reconcile(0xc0003e90e0, {0x1e0e8f8, 0xc0011953e0}, {{{0x0, 0x0}, {0xc000881d80, 0xe}}})
	/workspace/controllers/clusterpolicy_controller.go:135 +0x4e5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1e0e850?, {0x1e0e8f8?, 0xc0011953e0?}, {{{0x0?, 0x1a78ee0?}, {0xc000881d80?, 0xc0013a35d0?}}})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00022c8c0, {0x1e0e850, 0xc000b33080}, {0x1982860?, 0xc0009ace80?})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320 +0x33c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00022c8c0, {0x1e0e850, 0xc000b33080})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:230 +0x333

This crash doesn't happen without daemonsets.annotations in chart values

kosyak avatar Mar 13 '23 09:03 kosyak

Hi All! Does this work? Did you find anything that works on any of the new releases?

dcgmExporter:
  podAnnotations:

alep avatar Jun 21 '23 18:06 alep

The issue reported should be fixed with later releases. Please try out latest version. Setting daemonsets.annotations helm parameter should be reflected on all Daemonsets that we create.

shivamerla avatar Sep 08 '23 21:09 shivamerla

Closing this issue as GPU Operator v23.3.0+ supports the daemonsets.annotations field for configuring custom annotations for all DaemonSets that GPU Operator manages.

cdesiniotis avatar Jan 31 '24 00:01 cdesiniotis