datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

[Backport 7.57.x] auto-instrumentation: increase default memory request to 100Mi

Open agent-platform-auto-pr[bot] opened this issue 1 year ago • 1 comments

Backport 023ef5945eaed5aea081f20a7788a68c0107f816 from #29392.


What does this PR do?

Increases the default memory request for the auto instrumentation container from 20Mi to 100Mi.

100Mi is closer to the recommended minimum memory requirements for Alpine.

Motivation

https://datadoghq.atlassian.net/browse/APMON-1472

Right now the containers may get OOM Killed when copying lib injection files from the lib injection container to the application container. This memory usage comes from the cp command's usage of sendfile and is correlated to the total number of files being copied.

From @levan-m:

From my perspective, making this change in Agent is cleaner. Memory increase will be tied to a specific Agent version upgrade. On the other hand, Operator config may change while Agent version is pinned (or vice versa). This could create a situation where Operator overrides default value in future Agent versions, or sets higher limit on older installations which work fine with 20mb.

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Testing limits manually.

Before:

$ docker run -it --rm  --memory=20Mib --memory-swap=20Mib -v "$(pwd):/out" gcr.io/datadoghq/dd-lib-python-init:2.12 /datadog-init/copy-lib.sh /out/
Killed

After:

$ docker run -it --rm  --memory=100Mib --memory-swap=100Mib -v "$(pwd):/out" gcr.io/datadoghq/dd-lib-python-init:2.12 /datadog-init/copy-lib.sh /out/

This change can also be validated by manually updating the cluster agent configuation:

Manually setting the resource request:

datadog:
 apiKey: <API-KEY>
 site: datadoghq.com
 tags:
      - env:<ENV>
 apm:
   instrumentation:
      enabled: true
      libVersions:
         java: "1"
         dotnet: "3"
         python: "2"
         js: "5"
         ruby: "2"
clusterAgent:
  env:
    - name: DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_INIT_RESOURCES_MEMORY
      value: 100Mi

Using a custom built image with the changes:

datadog:
 apiKey: <API-KEY>
 site: datadoghq.com
 tags:
      - env:<ENV>
 apm:
   instrumentation:
      enabled: true
      libVersions:
         java: "1"
         dotnet: "3"
         python: "2"
         js: "5"
         ruby: "2"
clusterAgent:
   image:
     name: <user>/cluster_agent
     tag: master
     repository: docker.io/<user>/cluster_agent
     doNotCheckTag: true
helm upgrade datadog-agent -f datadog-values.yml datadog/datadog

Sample app configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kyle-django-app
  labels:
    app: python-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: python-app
  template:
    metadata:
      labels:
        app: python-app
    spec:
      containers:
      - name: app
        image: ddkverhoog/django-helloworld:v1.0.1
        env:
          - name: DD_TRACE_DEBUG
            value: "true"
        readinessProbe:
          timeoutSeconds: 1
          successThreshold: 1
          failureThreshold: 1
          httpGet:
            host:
            scheme: HTTP
            path: /
            port: 18080
          initialDelaySeconds: 30
          periodSeconds: 1
        ports:
          - containerPort: 18080
            protocol: TCP

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=44544697 --os-family=ubuntu

Note: This applies to commit 3726439e

pr-commenter[bot] avatar Sep 17 '24 21:09 pr-commenter[bot]

/merge

davidor avatar Sep 18 '24 07:09 davidor

:steam_locomotive: MergeQueue: pull request added to the queue

The median merge time in 7.57.x is 29m.

Use /merge -c to cancel this operation!

dd-devflow[bot] avatar Sep 18 '24 07:09 dd-devflow[bot]