[feature request] Add a way to set pod annotations for dcgm exporter
Our monitoring system (datadog) requires us to set pod annotations to the exporter pods.
Would be great if you could add a way to set spec.template.metadata.annotations of the daemonset.
Thanks
we will look into adding this with future releases.
+1 This would be super useful for us too
I can see daemonsets.annotations here helm -n gpu-operator get values gpu-operator --all (app version v22.9.2), are they intended for this issue usecase?
When I declare these annotations in chart values
daemonsets:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "9400"
prometheus.io/scrape: "true"
the chart deploys successfully but gpu-operator pod crashes with this error:
{"level":"info","ts":1678700104.895753,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1678700104.8987215,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1678700104.903535,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1678700104.9083395,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1678700104.9132628,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1678700104.9150171,"msg":"Observed a panic in reconciler: assignment to entry in nil map","controller":"clusterpolicy-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"558a2f1a-5f56-41fe-a896-23a7b965c55b"}
panic: assignment to entry in nil map [recovered]
panic: assignment to entry in nil map
goroutine 893 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118 +0x1f4
panic({0x1902300, 0x1df3cf0})
/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/NVIDIA/gpu-operator/controllers.applyCommonDaemonsetMetadata(...)
/workspace/controllers/object_controls.go:589
github.com/NVIDIA/gpu-operator/controllers.preProcessDaemonSet(0xc002288480, {{0x1e0e8f8, 0xc0011953e0}, 0xc000a34000, {0xc00004a053, 0xc}, {0xc00159a000, 0x10, 0x10}, {0xc0003c5680, ...}, ...})
/workspace/controllers/object_controls.go:567 +0xab8
github.com/NVIDIA/gpu-operator/controllers.DaemonSet({{0x1e0e8f8, 0xc0011953e0}, 0xc000a34000, {0xc00004a053, 0xc}, {0xc00159a000, 0x10, 0x10}, {0xc0003c5680, 0x10, ...}, ...})
/workspace/controllers/object_controls.go:3099 +0x4a5
github.com/NVIDIA/gpu-operator/controllers.(*ClusterPolicyController).step(0x2b80c40)
/workspace/controllers/state_manager.go:885 +0x136
github.com/NVIDIA/gpu-operator/controllers.(*ClusterPolicyReconciler).Reconcile(0xc0003e90e0, {0x1e0e8f8, 0xc0011953e0}, {{{0x0, 0x0}, {0xc000881d80, 0xe}}})
/workspace/controllers/clusterpolicy_controller.go:135 +0x4e5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1e0e850?, {0x1e0e8f8?, 0xc0011953e0?}, {{{0x0?, 0x1a78ee0?}, {0xc000881d80?, 0xc0013a35d0?}}})
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00022c8c0, {0x1e0e850, 0xc000b33080}, {0x1982860?, 0xc0009ace80?})
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320 +0x33c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00022c8c0, {0x1e0e850, 0xc000b33080})
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:230 +0x333
This crash doesn't happen without daemonsets.annotations in chart values
Hi All! Does this work? Did you find anything that works on any of the new releases?
dcgmExporter:
podAnnotations:
The issue reported should be fixed with later releases. Please try out latest version. Setting daemonsets.annotations helm parameter should be reflected on all Daemonsets that we create.
Closing this issue as GPU Operator v23.3.0+ supports the daemonsets.annotations field for configuring custom annotations for all DaemonSets that GPU Operator manages.