datadog-agent
datadog-agent copied to clipboard
[CONTP-283] Should require also API Group (in addition to resource name) for generic metadata collection
What does this PR do?
This PR includes the resource api group in the configuration parameter for generic metadata collection.
In other words, instead of having DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_RESOURCES = [deployments statefulsets nodes], we will now have DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_RESOURCES = [apps/deployments apps/statefulsets /nodes]
Motivation
Avoid collisions in cases were we have the same resource name under different api groups. An example of this is GKE:
On GKE, we have the nodes resource under two different API Groups:
metrics.k8s.io""(empty api group, corresponding to the default empty group in kubernetes)
In this case, if the user asks to collect metadata of nodes, it will not be possible to know if we need to collect metadata of
nodes.metrics.k8s.ionodes
This results in a conflict.
Additional Notes
- With this change, the user can also indicate the group version if they wish to by using the format
{group}/{version}/{resource}. For exampleapps/v1/deployments. When using this format, the discovery client will not be used to fill the version, and the indicated version will be used as it is.
Possible Drawbacks / Trade-offs
Describe how to test/QA your changes
❗ For better validation, do this QA on GKE because the issue was initially discovered on GKE due to having same resource name under different api groups (see #motivation section for more information) ❗
Deploy the cluster agent with the following helm file:
datadog:
apiKeyExistingSecret: datadog-secret
appKeyExistingSecret: datadog-secret
kubelet:
tlsVerify: false
clusterAgent:
enabled: true
replicas: 1
env:
- name: DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_ENABLED
value: "true"
- name: DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_RESOURCES
value: "apps/deployments apps/daemonsets /nodes"
Ensure that metadata is collected successfully for deployments, daemonsets, and nodes.
kubectl exec <cluster-agent-pod> -- agent workload-list -v
=== Entity kubernetes_metadata sources(merged):[kubeapiserver] id: deployments/kube-system/kube-dns-autoscaler ===
----------- Entity ID -----------
Kind: kubernetes_metadata ID: deployments/kube-system/kube-dns-autoscaler
----------- Entity Meta -----------
Name: kube-dns-autoscaler
Namespace: kube-system
Annotations: deployment.kubernetes.io/revision:1
Labels: addonmanager.kubernetes.io/mode:Reconcile k8s-app:kube-dns-autoscaler kubernetes.io/cluster-service:true
----------- Resource -----------
apps/v1, Resource=deployments
===
=== Entity kubernetes_metadata sources(merged):[kubeapiserver] id: nodes//gke-adelhajhassan-default-pool-14a7bd1d-jnf2 ===
----------- Entity ID -----------
Kind: kubernetes_metadata ID: nodes//gke-adelhajhassan-default-pool-14a7bd1d-jnf2
----------- Entity Meta -----------
Name: gke-adelhajhassan-default-pool-14a7bd1d-jnf2
Namespace:
Annotations: node.gke.io/last-applied-node-taints: volumes.kubernetes.io/controller-managed-attach-detach:true container.googleapis.com/instance_id:3216393220270216000 csi.volume.kubernetes.io/nodeid:{"pd.csi.storage.gke.io":"projects/datadog-sandbox/zones/us-central1-c/instances/gke-adelhajhassan-default-pool-14a7bd1d-jnf2"} node.alpha.kubernetes.io/ttl:0 node.gke.io/last-applied-node-labels:cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=containerd,cloud.google.com/gke-cpu-scaling-level=2,cloud.google.com/gke-logging-variant=DEFAULT,cloud.google.com/gke-max-pods-per-node=110,cloud.google.com/gke-nodepool=default-pool,cloud.google.com/gke-os-distribution=cos,cloud.google.com/gke-provisioning=standard,cloud.google.com/gke-stack-type=IPV4,cloud.google.com/machine-family=e2,cloud.google.com/private-node=false
Labels: beta.kubernetes.io/arch:amd64 cloud.google.com/gke-boot-disk:pd-balanced cloud.google.com/gke-cpu-scaling-level:2 kubernetes.io/arch:amd64 topology.gke.io/zone:us-central1-c cloud.google.com/gke-max-pods-per-node:110 cloud.google.com/gke-nodepool:default-pool cloud.google.com/gke-provisioning:standard failure-domain.beta.kubernetes.io/region:us-central1 topology.kubernetes.io/zone:us-central1-c cloud.google.com/gke-container-runtime:containerd cloud.google.com/gke-logging-variant:DEFAULT cloud.google.com/gke-os-distribution:cos failure-domain.beta.kubernetes.io/zone:us-central1-c kubernetes.io/os:linux topology.kubernetes.io/region:us-central1 node.kubernetes.io/instance-type:e2-medium beta.kubernetes.io/instance-type:e2-medium beta.kubernetes.io/os:linux cloud.google.com/gke-stack-type:IPV4 cloud.google.com/machine-family:e2 cloud.google.com/private-node:false kubernetes.io/hostname:gke-adelhajhassan-default-pool-14a7bd1d-jnf2
----------- Resource -----------
/v1, Resource=nodes
===
=== Entity kubernetes_metadata sources(merged):[kubeapiserver] id: daemonsets/gmp-system/collector ===
----------- Entity ID -----------
Kind: kubernetes_metadata ID: daemonsets/gmp-system/collector
----------- Entity Meta -----------
Name: collector
Namespace: gmp-system
Annotations: components.gke.io/layer:addon
Labels: addonmanager.kubernetes.io/mode:Reconcile
----------- Resource -----------
apps/v1, Resource=daemonsets
===
Test changes on VM
Use this command from test-infra-definitions to manually test this PR changes on a VM:
inv create-vm --pipeline-id=38263141 --os-family=ubuntu
Note: This applies to commit bfe0aefb
Regression Detector
Regression Detector Results
Run ID: e517c00c-f6f1-4afd-9c58-aedd17962133 Metrics dashboard Target profiles
Baseline: f350ef14a5ecfaf059e313b7d87460aa24460a81 Comparison: bfe0aefbf1d1406d960e58140502d2d705d74d82
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
No significant changes in experiment optimization goals
Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%
There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | links |
|---|---|---|---|---|---|
| ➖ | basic_py_check | % cpu utilization | +0.10 | [-2.55, +2.76] | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.00, +0.00] | Logs |
| ➖ | idle | memory utilization | -0.08 | [-0.11, -0.05] | Logs |
| ➖ | file_tree | memory utilization | -0.12 | [-0.20, -0.04] | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.60 | [-13.42, +12.23] | Logs |
| ➖ | pycheck_1000_100byte_tags | % cpu utilization | -0.62 | [-5.32, +4.08] | Logs |
| ➖ | otel_to_otel_logs | ingress throughput | -1.07 | [-1.88, -0.26] | Logs |
| ➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -1.23 | [-2.11, -0.35] | Logs |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Is this change backward compatible? meaning if I had DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_RESOURCES = [deployments statefulsets nodes] would that still works as expected with the code?
Is this change backward compatible? meaning if I had
DD_CLUSTER_AGENT_KUBE_METADATA_COLLECTION_RESOURCES = [deployments statefulsets nodes]would that still works as expected with the code?
No it is not backward compatible, but this config option is not publicly documented, and is not used in the helm chart nor in the operator, so nothing should break.
/merge
:steam_locomotive: MergeQueue: pull request added to the queue
The median merge time in main is 25m.
Use /merge -c to cancel this operation!