datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

[CONTINT-3439][language-detection] implement cleanup mechanism for language detection

Open adel121 opened this issue 1 year ago • 2 comments

ATTENTION

  • This PR is better reviewed commit-by-commit
  • Documentation is not yet updated. Will update it after review.

What does this PR do?

This PR implements cleanup mechanism for language detection.

Motivation

Be able to clear languages when they have not been detected for a sufficient period of time.

Additional Notes

In order to perform cleanup, we need to clean expired languages from every component that stores them.

Detected languages are mainly found in 3 components:

  • In DCA's workloadmeta (Injectable Languages and Detectable Languages)
  • As annotations on kubernetes resources (currently only on deployments)
  • Locally in DCA language detection handler, where they are stored in memory before being pushed to workloadmeta.

This is how cleanup works:

  • The language detection client keeps a batch of languages detected on its node aggregated by pod.
  • It sends this batch to the DCA PLD api handler every language_detection.cleanup.ttl_refresh_period.
  • When the handler receives the request from the PLD client, it updates the TTL of all languages that it is keeping in memory by setting it to time.Now() + language_detection.cleanup.language_ttl.

Clean up in workloadmeta and in PLD handler:

  • Asynchronously, we have a goroutine that runs every language_detection.cleanup.period. It scans through the TTLs of the languages and removes (locally in memory) any language that has expired. It then flushes the languages to workloadmeta by sending Push events to update the set of Detected Languages in workloadmeta deployment entities (by sending a set/unset event depending on whether the deployment still has unexpired languages or not).
  • Also asynchronously, we have another goroutine that watches unset workloadmeta events of deployments coming fromt kubeapiserver source. Whenever an unset event is received, it deletes the corresponding owner locally, and pushes an event to workloadmeta to unset the detected languages in of the deployment, ensuring that the whole entity is removed from workloamdeta after it has been deleted.

Clean up of language annotations:

  • Annotations are cleaned up via the patcher. The patcher listens to all types of deployment workloadmeta events having the language detection handler as a source.
  • When an unset event is received, it means that the detected languages have been cleared. Therefore, the patcher checks if the deployment still has some annotations and patches the deployment to remove the language annotations that are still attached to the deployment.

Possible Drawbacks / Trade-offs

  • Potential stress on the api server.
  • We can improve telemetry by including metrics related to the cleanup process. We can do this in a separate PR once the algorithm is agreed upon and approved.
  • For the cleanup mechanism to work, we should ensure that we have language_detection.cleanup.language_ttl sufficiently larger than language_detection.cleanup.ttl_refresh_period. If these values are close, it might result in flakiness. This flakiness can be easily avoided by having a sufficient difference between the two configs. A factor of 3 should be enough.
  • The set of cleanup parameters might require some tuning based on testing on larger clusters.

Describe how to test/QA your changes

To QA these changes, deploy the agent and cluster agent with the following:

  • language detection enabled on both the agent and cluster agent
  • process agent enabled and process collection enabled
  • sufficient rbac for the cluster agent to patch deployments
  • admission controller enabled and mutateUnlabelled set to true
  • set the config params of the cleanup process (you can set them to low values for a quicker QA)

Then create a deployment with some containers running dummy processes with supported languages (e.g. ruby, python, java, go).

Check for the following while randomly adding/removing containers from the deployment podspec and launching a rollout:

  • The deployment is patched with the correct languages
  • The deployment content of workloadmeta is always correct (shows the detected and injectable languages)

You can also verify that when a deployment is deleted, it is fully cleared from workloadmeta, and when recreated, language detection still works and restores the correct state after some time.

Here are the steps:

1- Deploy the agent and cluster agent using helm:

datadog:
  apiKeyExistingSecret: datadog-secret
  appKeyExistingSecret: datadog-secret
  kubelet:
    tlsVerify: false
  processAgent:
    enabled: true
    processCollection: true
  env:
    - name: DD_LANGUAGE_DETECTION_ENABLED
      value: "true"

agents:
  telemetry: enabled
  containers:
    agent:
      env:
        - name: DD_LANGUAGE_DETECTION_CLEANUP_TTL_REFRESH_PERIOD
          value: "5s"

clusterAgent:
  enabled: true
  replicas: 1
  admissionController:
    enabled: true
    mutateUnlabelled: true
  env:
    - name: DD_LANGUAGE_DETECTION_ENABLED
      value: "true"
    - name: DD_LANGUAGE_DETECTION_CLEANUP_PERIOD
      value: "10s"
    - name: DD_LANGUAGE_DETECTION_CLEANUP_LANGUAGE_TTL
      value: "40s"

2- Allow the cluster agent to patch deployment by updating the clusterrole. Use the following command: kubectl edit clusterrole datadog-agent-cluster-agent, then add the patch action to the list of verbs for any apiGroup containing deployment as a resource.

3- Create a deployment with 2 ruby containers, 1 python container and 1 ubuntu container.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dummy-user-deployment-new
  labels:
    app: user-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: user-app
  template:
    metadata:
      labels:
        app: user-app
    spec:
      containers:
        - name: dummy-ruby-container-extra
          image: ruby:2.7-slim # You can replace this with the desired Ruby base image
          command: ["ruby", "-e", "loop { sleep 1000 }"] # Ruby script to sleep forever
        - name: dummy-ruby-container
          image: ruby:2.7-slim # You can replace this with the desired Ruby base image
          command: ["ruby", "-e", "loop { sleep 1000 }"] # Ruby script to sleep forever
        - name: python-process-container
          image: python:3.7 # Replace with your Python image
          command: ["python3", "-u", "your-python-script.py"]
          volumeMounts:
            - name: python-script-volume
              mountPath: /your-python-script.py
              subPath: your-python-script.py
        - name: ubuntu
          image: ubuntu
          command:
            - sleep
            - infinity
      volumes:
        - name: python-script-volume
          configMap:
            name: python-script-configmap
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: python-script-configmap
data:
  your-python-script.py: |
    while True:
      pass

4- Wait some time, and then verify that the languages are correctly set in workloadmeta and that the language annotations are patched on top of the deployment:

kubectl get deployment dummy-user-deployment-new -o yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    internal.dd.datadoghq.com/dummy-ruby-container-extra.detected_langs: ruby
    internal.dd.datadoghq.com/dummy-ruby-container.detected_langs: ruby
    internal.dd.datadoghq.com/python-process-container.detected_langs: python
....
kubectl exec <cluster-agent-pod> -- agent workload-list

...
=== Entity kubernetes_deployment sources(merged):[kubeapiserver language_detection_server] id: default/dummy-user-deployment-new ===
----------- Entity ID -----------
Kind: kubernetes_deployment ID: default/dummy-user-deployment-new

----------- Unified Service Tagging -----------
Env : 
Service : 
Version : 
----------- Injectable Languages -----------
Container python-process-container=>[python]
Container dummy-ruby-container-extra=>[ruby]
Container dummy-ruby-container=>[ruby]
----------- Detected Languages -----------
Container dummy-ruby-container=>[ruby]
Container dummy-ruby-container-extra=>[ruby]
Container python-process-container=>[python]
===
...

5- Now remove the python container, and perform a rollout of the deployment. Wait some time and check the languages again. Python language should disappear:

kubectl get deployment dummy-user-deployment-new -o yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    internal.dd.datadoghq.com/dummy-ruby-container-extra.detected_langs: ruby
    internal.dd.datadoghq.com/dummy-ruby-container.detected_langs: ruby
....
kubectl exec <cluster-agent-pod> -- agent workload-list

...
=== Entity kubernetes_deployment sources(merged):[kubeapiserver language_detection_server] id: default/dummy-user-deployment-new ===
----------- Entity ID -----------
Kind: kubernetes_deployment ID: default/dummy-user-deployment-new

----------- Unified Service Tagging -----------
Env : 
Service : 
Version : 
----------- Injectable Languages -----------
Container dummy-ruby-container-extra=>[ruby]
Container dummy-ruby-container=>[ruby]
----------- Detected Languages -----------
Container dummy-ruby-container=>[ruby]
Container dummy-ruby-container-extra=>[ruby]
===
...

6- Now remove the 2 ruby containers, and launch a rollout. Check the languages after some time, they should be fully cleared:

kubectl get deployment dummy-user-deployment-new -o yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
....
kubectl exec <cluster-agent-pod> -- agent workload-list

...
=== Entity kubernetes_deployment sources(merged):[kubeapiserver language_detection_server] id: default/dummy-user-deployment-new ===
----------- Entity ID -----------
Kind: kubernetes_deployment ID: default/dummy-user-deployment-new

----------- Unified Service Tagging -----------
Env : 
Service : 
Version : 
----------- Injectable Languages -----------
----------- Detected Languages -----------
===
...

7- Finally, delete the deployment and verify that it has been removed from workloadmeta. Then recreate it with all the 4 containers, and ensure that language annotations and languages in workloadmeta are correctly set after some time.

Reviewer's Checklist

  • [ ] If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • [ ] Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • [ ] A release note has been added or the changelog/no-changelog label has been applied.
  • [ ] Changed code has automated tests for its functionality.
  • [ ] Adequate QA/testing plan information is provided. Except if the qa/skip-qa label, with required either qa/done or qa/no-code-change labels, are applied.
  • [ ] At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • [ ] If applicable, the need-change/operator and need-change/helm labels have been applied.
  • [ ] If applicable, the k8s/<min-version> label, indicating the lowest Kubernetes version compatible with this feature.
  • [ ] If applicable, the config template has been updated.

adel121 avatar Feb 13 '24 23:02 adel121

Bloop Bleep... Dogbot Here

Regression Detector Results

Run ID: d96f8916-3066-4a02-b7a5-208ab042d6b6 Baseline: 294ae46153c39545b7e9517a65fed146f50827fb Comparison: f8f9827d18d5ef7e12a044a8a8cf897e40cd78e8 Total CPUs: 7

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI
file_to_blackhole % cpu utilization -0.72 [-7.24, +5.81]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI
otel_to_otel_logs ingress throughput +1.96 [+1.32, +2.60]
process_agent_standard_check memory utilization +0.64 [+0.60, +0.68]
process_agent_standard_check_with_stats memory utilization +0.60 [+0.56, +0.63]
tcp_syslog_to_blackhole ingress throughput +0.26 [+0.18, +0.34]
process_agent_real_time_mode memory utilization +0.05 [+0.02, +0.09]
uds_dogstatsd_to_api ingress throughput +0.00 [-0.00, +0.00]
tcp_dd_logs_filter_exclude ingress throughput +0.00 [-0.00, +0.00]
trace_agent_json ingress throughput -0.01 [-0.05, +0.03]
trace_agent_msgpack ingress throughput -0.04 [-0.06, -0.02]
uds_dogstatsd_to_api_cpu % cpu utilization -0.17 [-1.60, +1.25]
file_tree memory utilization -0.22 [-0.30, -0.14]
idle memory utilization -0.55 [-0.58, -0.53]
file_to_blackhole % cpu utilization -0.72 [-7.24, +5.81]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

pr-commenter[bot] avatar Feb 14 '24 00:02 pr-commenter[bot]

Thanks @kkhor-datadog and @davidor for your comments

My apologies for the dangling fmt.Println's. I had them for debugging purposes and missed removing them.

Will remove them

adel121 avatar Feb 15 '24 14:02 adel121

/merge

adel121 avatar Feb 16 '24 10:02 adel121

:steam_locomotive: MergeQueue

Pull request added to the queue.

This build is next! (estimated merge in less than 49m)

Use /merge -c to cancel this operation!

dd-devflow[bot] avatar Feb 16 '24 10:02 dd-devflow[bot]