datadog-agent
datadog-agent copied to clipboard
[CONTINT-3439][language-detection] implement cleanup mechanism for language detection
ATTENTION
- This PR is better reviewed commit-by-commit
- Documentation is not yet updated. Will update it after review.
What does this PR do?
This PR implements cleanup mechanism for language detection.
Motivation
Be able to clear languages when they have not been detected for a sufficient period of time.
Additional Notes
In order to perform cleanup, we need to clean expired languages from every component that stores them.
Detected languages are mainly found in 3 components:
- In DCA's workloadmeta (Injectable Languages and Detectable Languages)
- As annotations on kubernetes resources (currently only on deployments)
- Locally in DCA language detection handler, where they are stored in memory before being pushed to workloadmeta.
This is how cleanup works:
- The language detection client keeps a batch of languages detected on its node aggregated by pod.
- It sends this batch to the DCA PLD api handler every
language_detection.cleanup.ttl_refresh_period
. - When the handler receives the request from the PLD client, it updates the TTL of all languages that it is keeping in memory by setting it to
time.Now() + language_detection.cleanup.language_ttl
.
Clean up in workloadmeta and in PLD handler:
- Asynchronously, we have a goroutine that runs every
language_detection.cleanup.period
. It scans through the TTLs of the languages and removes (locally in memory) any language that has expired. It then flushes the languages to workloadmeta by sending Push events to update the set of Detected Languages in workloadmeta deployment entities (by sending a set/unset event depending on whether the deployment still has unexpired languages or not). - Also asynchronously, we have another goroutine that watches unset workloadmeta events of deployments coming fromt
kubeapiserver
source. Whenever an unset event is received, it deletes the corresponding owner locally, and pushes an event to workloadmeta to unset the detected languages in of the deployment, ensuring that the whole entity is removed from workloamdeta after it has been deleted.
Clean up of language annotations:
- Annotations are cleaned up via the patcher. The patcher listens to all types of deployment workloadmeta events having the language detection handler as a source.
- When an unset event is received, it means that the detected languages have been cleared. Therefore, the patcher checks if the deployment still has some annotations and patches the deployment to remove the language annotations that are still attached to the deployment.
Possible Drawbacks / Trade-offs
- Potential stress on the api server.
- We can improve telemetry by including metrics related to the cleanup process. We can do this in a separate PR once the algorithm is agreed upon and approved.
- For the cleanup mechanism to work, we should ensure that we have
language_detection.cleanup.language_ttl
sufficiently larger thanlanguage_detection.cleanup.ttl_refresh_period
. If these values are close, it might result in flakiness. This flakiness can be easily avoided by having a sufficient difference between the two configs. A factor of 3 should be enough. - The set of cleanup parameters might require some tuning based on testing on larger clusters.
Describe how to test/QA your changes
To QA these changes, deploy the agent and cluster agent with the following:
- language detection enabled on both the agent and cluster agent
- process agent enabled and process collection enabled
- sufficient rbac for the cluster agent to patch deployments
- admission controller enabled and
mutateUnlabelled
set totrue
- set the config params of the cleanup process (you can set them to low values for a quicker QA)
Then create a deployment with some containers running dummy processes with supported languages (e.g. ruby, python, java, go).
Check for the following while randomly adding/removing containers from the deployment podspec and launching a rollout:
- The deployment is patched with the correct languages
- The deployment content of workloadmeta is always correct (shows the detected and injectable languages)
You can also verify that when a deployment is deleted, it is fully cleared from workloadmeta, and when recreated, language detection still works and restores the correct state after some time.
Here are the steps:
1- Deploy the agent and cluster agent using helm:
datadog:
apiKeyExistingSecret: datadog-secret
appKeyExistingSecret: datadog-secret
kubelet:
tlsVerify: false
processAgent:
enabled: true
processCollection: true
env:
- name: DD_LANGUAGE_DETECTION_ENABLED
value: "true"
agents:
telemetry: enabled
containers:
agent:
env:
- name: DD_LANGUAGE_DETECTION_CLEANUP_TTL_REFRESH_PERIOD
value: "5s"
clusterAgent:
enabled: true
replicas: 1
admissionController:
enabled: true
mutateUnlabelled: true
env:
- name: DD_LANGUAGE_DETECTION_ENABLED
value: "true"
- name: DD_LANGUAGE_DETECTION_CLEANUP_PERIOD
value: "10s"
- name: DD_LANGUAGE_DETECTION_CLEANUP_LANGUAGE_TTL
value: "40s"
2- Allow the cluster agent to patch deployment by updating the clusterrole. Use the following command: kubectl edit clusterrole datadog-agent-cluster-agent
, then add the patch
action to the list of verbs for any apiGroup
containing deployment as a resource.
3- Create a deployment with 2 ruby containers, 1 python container and 1 ubuntu container.
apiVersion: apps/v1
kind: Deployment
metadata:
name: dummy-user-deployment-new
labels:
app: user-app
spec:
replicas: 3
selector:
matchLabels:
app: user-app
template:
metadata:
labels:
app: user-app
spec:
containers:
- name: dummy-ruby-container-extra
image: ruby:2.7-slim # You can replace this with the desired Ruby base image
command: ["ruby", "-e", "loop { sleep 1000 }"] # Ruby script to sleep forever
- name: dummy-ruby-container
image: ruby:2.7-slim # You can replace this with the desired Ruby base image
command: ["ruby", "-e", "loop { sleep 1000 }"] # Ruby script to sleep forever
- name: python-process-container
image: python:3.7 # Replace with your Python image
command: ["python3", "-u", "your-python-script.py"]
volumeMounts:
- name: python-script-volume
mountPath: /your-python-script.py
subPath: your-python-script.py
- name: ubuntu
image: ubuntu
command:
- sleep
- infinity
volumes:
- name: python-script-volume
configMap:
name: python-script-configmap
---
apiVersion: v1
kind: ConfigMap
metadata:
name: python-script-configmap
data:
your-python-script.py: |
while True:
pass
4- Wait some time, and then verify that the languages are correctly set in workloadmeta and that the language annotations are patched on top of the deployment:
kubectl get deployment dummy-user-deployment-new -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
internal.dd.datadoghq.com/dummy-ruby-container-extra.detected_langs: ruby
internal.dd.datadoghq.com/dummy-ruby-container.detected_langs: ruby
internal.dd.datadoghq.com/python-process-container.detected_langs: python
....
kubectl exec <cluster-agent-pod> -- agent workload-list
...
=== Entity kubernetes_deployment sources(merged):[kubeapiserver language_detection_server] id: default/dummy-user-deployment-new ===
----------- Entity ID -----------
Kind: kubernetes_deployment ID: default/dummy-user-deployment-new
----------- Unified Service Tagging -----------
Env :
Service :
Version :
----------- Injectable Languages -----------
Container python-process-container=>[python]
Container dummy-ruby-container-extra=>[ruby]
Container dummy-ruby-container=>[ruby]
----------- Detected Languages -----------
Container dummy-ruby-container=>[ruby]
Container dummy-ruby-container-extra=>[ruby]
Container python-process-container=>[python]
===
...
5- Now remove the python container, and perform a rollout of the deployment. Wait some time and check the languages again. Python language should disappear:
kubectl get deployment dummy-user-deployment-new -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
internal.dd.datadoghq.com/dummy-ruby-container-extra.detected_langs: ruby
internal.dd.datadoghq.com/dummy-ruby-container.detected_langs: ruby
....
kubectl exec <cluster-agent-pod> -- agent workload-list
...
=== Entity kubernetes_deployment sources(merged):[kubeapiserver language_detection_server] id: default/dummy-user-deployment-new ===
----------- Entity ID -----------
Kind: kubernetes_deployment ID: default/dummy-user-deployment-new
----------- Unified Service Tagging -----------
Env :
Service :
Version :
----------- Injectable Languages -----------
Container dummy-ruby-container-extra=>[ruby]
Container dummy-ruby-container=>[ruby]
----------- Detected Languages -----------
Container dummy-ruby-container=>[ruby]
Container dummy-ruby-container-extra=>[ruby]
===
...
6- Now remove the 2 ruby containers, and launch a rollout. Check the languages after some time, they should be fully cleared:
kubectl get deployment dummy-user-deployment-new -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
....
kubectl exec <cluster-agent-pod> -- agent workload-list
...
=== Entity kubernetes_deployment sources(merged):[kubeapiserver language_detection_server] id: default/dummy-user-deployment-new ===
----------- Entity ID -----------
Kind: kubernetes_deployment ID: default/dummy-user-deployment-new
----------- Unified Service Tagging -----------
Env :
Service :
Version :
----------- Injectable Languages -----------
----------- Detected Languages -----------
===
...
7- Finally, delete the deployment and verify that it has been removed from workloadmeta. Then recreate it with all the 4 containers, and ensure that language annotations and languages in workloadmeta are correctly set after some time.
Reviewer's Checklist
- [ ] If known, an appropriate milestone has been selected; otherwise the
Triage
milestone is set. - [ ] Use the
major_change
label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote. - [ ] A release note has been added or the
changelog/no-changelog
label has been applied. - [ ] Changed code has automated tests for its functionality.
- [ ] Adequate QA/testing plan information is provided. Except if the
qa/skip-qa
label, with required eitherqa/done
orqa/no-code-change
labels, are applied. - [ ] At least one
team/..
label has been applied, indicating the team(s) that should QA this change. - [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
- [ ] If applicable, the
need-change/operator
andneed-change/helm
labels have been applied. - [ ] If applicable, the
k8s/<min-version>
label, indicating the lowest Kubernetes version compatible with this feature. - [ ] If applicable, the config template has been updated.
Bloop Bleep... Dogbot Here
Regression Detector Results
Run ID: d96f8916-3066-4a02-b7a5-208ab042d6b6 Baseline: 294ae46153c39545b7e9517a65fed146f50827fb Comparison: f8f9827d18d5ef7e12a044a8a8cf897e40cd78e8 Total CPUs: 7
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
No significant changes in experiment optimization goals
Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%
There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
Experiments ignored for regressions
Regressions in experiments with settings containing erratic: true
are ignored.
perf | experiment | goal | Δ mean % | Δ mean % CI |
---|---|---|---|---|
➖ | file_to_blackhole | % cpu utilization | -0.72 | [-7.24, +5.81] |
Fine details of change detection per experiment
perf | experiment | goal | Δ mean % | Δ mean % CI |
---|---|---|---|---|
➖ | otel_to_otel_logs | ingress throughput | +1.96 | [+1.32, +2.60] |
➖ | process_agent_standard_check | memory utilization | +0.64 | [+0.60, +0.68] |
➖ | process_agent_standard_check_with_stats | memory utilization | +0.60 | [+0.56, +0.63] |
➖ | tcp_syslog_to_blackhole | ingress throughput | +0.26 | [+0.18, +0.34] |
➖ | process_agent_real_time_mode | memory utilization | +0.05 | [+0.02, +0.09] |
➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.00, +0.00] |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.00, +0.00] |
➖ | trace_agent_json | ingress throughput | -0.01 | [-0.05, +0.03] |
➖ | trace_agent_msgpack | ingress throughput | -0.04 | [-0.06, -0.02] |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -0.17 | [-1.60, +1.25] |
➖ | file_tree | memory utilization | -0.22 | [-0.30, -0.14] |
➖ | idle | memory utilization | -0.55 | [-0.58, -0.53] |
➖ | file_to_blackhole | % cpu utilization | -0.72 | [-7.24, +5.81] |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Thanks @kkhor-datadog and @davidor for your comments
My apologies for the dangling fmt.Println
's. I had them for debugging purposes and missed removing them.
Will remove them
/merge
:steam_locomotive: MergeQueue
Pull request added to the queue.
This build is next! (estimated merge in less than 49m)
Use /merge -c
to cancel this operation!