vector icon indicating copy to clipboard operation
vector copied to clipboard

Datadog Agent Source Regression in v0.24x

Open jonwinton opened this issue 3 years ago • 51 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We noticed the following issue when upgrading from v0.23.3 to v0.24.0:

(1) CPU became spikey/less consistent

The charts below show how CPU explodes, Datadog forwarder error rates climb, HAProxy 5xx rates climb

Screenshot 2022-11-17 at 3 24 52 PM

(2) 504 error codes coming from the Datadog Agents writing to Vector for only the /api/beta/sketches endpoint

2022-11-15 22:52:12 UTC | CORE | ERROR | (pkg/forwarder/worker.go:184 in process) | Error while processing transaction: error "504 Gateway Time-out" while sending transaction to "http://vector-haproxy.vector.svc.cluster.local:6000/api/beta/sketches", rescheduling it: "<html><body><h1>504 Gateway Time-out</h1>\nThe server didn't respond in time.\n</body></html>\n"

This is also apparent in errors surfacing from HAProxy (deployed via the Vector Helm chart). HAProxy is using a leastconn balance strategy.

(3) Error logs coming from Vector for shutting down connections

{"host":"vector-599576bd9b-w2bq7","message":"error shutting down IO: Transport endpoint is not connected (os error 107)","metadata":{"kind":"event","level":"DEBUG","module_path":"hyper::proto::h1::conn","target":"hyper::proto::h1::conn"},"pid":1,"source_ty
pe":"internal_logs","timestamp":"2022-11-16T20:11:46.533762489Z"}
{"host":"vector-599576bd9b-w2bq7","message":"connection error: error shutting down connection: Transport endpoint is not connected (os error 107)","metadata":{"kind":"event","level":"DEBUG","module_path":"hyper::server::server::new_svc","target":"hyper::se
rver::server::new_svc"},"pid":1,"source_type":"internal_logs","timestamp":"2022-11-16T20:11:46.533777847Z"}

Configuration

data_dir: /vector-data-dir

api:
  enabled: true
  address: 127.0.0.1:8686
  playground: false

sources:
  internal_logs:
    type: internal_logs

  # Datadog Agent telemetry
  datadog_agent:
    type: datadog_agent
    address: "0.0.0.0:6000"
    multiple_outputs: true # To automatically separate metrics and logs

sinks:
  console:
    type: console
    inputs:
      - internal_logs
    target: stdout
    encoding:
      codec: json

  # Datadog metrics output
  datadog_metrics:
    type: datadog_metrics
    inputs:
      - <inputs>...
    api_key: "${DATADOG_API_KEY}"

Version

0.24.0-distroless-libc

Debug Output

I can only recreate this issue in critical environments where I can't create this output information :(

Example Data

I'm not sure what the Datadog Agent is sending to this endpoint

Additional Context

We're running in AWS EKS 1.21 in

References

No response

jonwinton avatar Nov 17 '22 23:11 jonwinton

We're looking into this, will keep you posted, @jonwinton !

barieom avatar Nov 18 '22 22:11 barieom

Thanks @barieom! I wanted to add this screenshot of steady state CPU usage. Orange line is CPU requests, red line is CPU limit. Looking at network traffic there is are no outliers in terms of higher network load on a single pod.

Screenshot 2022-11-19 at 8 37 09 AM

Let me know if I can supply anything else to help!

jonwinton avatar Nov 19 '22 16:11 jonwinton

Thanks for filing this issue @jonwinton.

It looks like the vector helm chart was used for configuration of vector.

Would you be able to provide the complete helm chart values used, and the chart version you are using?

This would expedite the investigation and reproduction effort.

neuronull avatar Nov 21 '22 17:11 neuronull

Definitely @neuronull !

Chart Version: 0.14.0 One comment on this: we're blocked from upgrading due to our current EKS version not having the autoscaling APIs defined in current versions of the chart. We're in the progress of updating, but it'll be awhile

We use Argo for deployment of the chart but we have a fun Helm args combining pipeline, I need some time to get the full config together.

jonwinton avatar Nov 21 '22 17:11 jonwinton

@neuronull here we go:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - vector
          topologyKey: failure-domain.beta.kubernetes.io/zone
        weight: 100
autoscaling:
  enabled: true
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 60
commonLabels:
  tags.datadoghq.com/service: vector
  tags.datadoghq.com/version: "0.24.0"
env:
- name: DATADOG_API_KEY
  valueFrom:
    secretKeyRef:
      key: ...
      name: ...
- name: DD_HOSTNAME
  valueFrom:
    fieldRef:
      fieldPath: spec.nodeName
- name: VECTOR_POD_NAME
  valueFrom:
    fieldRef:
      fieldPath: metadata.name
- name: VECTOR_VERSION
  valueFrom:
    fieldRef:
      fieldPath: metadata.labels['tags.datadoghq.com/version']
extraVolumeMounts:
- mountPath: /mnt/secrets-store
  name: vector-secrets
  readOnly: true
extraVolumes:
- csi:
    driver: secrets-store.csi.k8s.io
    readOnly: true
    volumeAttributes:
      secretProviderClass: ...
  name: vector-secrets
podDisruptionBudget:
  enabled: true
  minAvailable: 10%
podLabels:
  tags.datadoghq.com/service: vector
podPriorityClassName: system-cluster-critical
role: Stateless-Aggregator
rollWorkload: true
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: <role id>
  create: true
updateStrategy:
  rollingUpdate:
    maxSurge: 0
    maxUnavailable: 1
  type: RollingUpdate
haproxy:
  enabled: true
  image:
    tag: &version 2.6.6
  resources:
    requests:
      cpu: 1000m
      memory: 1Gi
    limits:
      cpu: 1000m
      memory: 1Gi

  autoscaling:
    enabled: true
    minReplicas: 9
    maxReplicas: 50

  podLabels:
    tags.datadoghq.com/service: vector-haproxy
    tags.datadoghq.com/version: *version

  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - vector
            topologyKey: failure-domain.beta.kubernetes.io/zone
          weight: 100

  podAnnotations:
    ad.datadoghq.com/haproxy.checks: |
      {
        "haproxy": {
          "init_config": {},
          "instances": [{
            "use_openmetrics": true,
            "openmetrics_endpoint": "http://%%host%%:1024/metrics",
            "histogram_buckets_as_distributions": true,
            "collect_counters_with_distributions": true
          }]
        }
      }

  customConfig: |
    global
      log stdout format raw local0
      maxconn 4096
      stats socket /tmp/haproxy
      hard-stop-after {{ .Values.haproxy.terminationGracePeriodSeconds }}s

    defaults
      log     global
      option  dontlognull
      retries 10
      option  redispatch
      option  allbackups
      timeout client 10s
      timeout server 10s
      timeout connect 5s

    resolvers coredns
      nameserver dns1 kube-dns.kube-system.svc.cluster.local:53
      resolve_retries 3
      timeout resolve 2s
      timeout retry 1s
      accepted_payload_size 8192
      hold valid 10s
      hold obsolete 60s

    frontend stats
      mode http
      bind :::1024
      option httplog
      http-request use-service prometheus-exporter if { path /metrics }
      stats enable
      stats hide-version  # Hide HAProxy version
      stats realm Haproxy\ Statistics  # Title text for popup window
      stats uri /haproxy_stats  # Stats URI
      stats refresh 10s


    frontend datadog-agent
      mode http
      bind :::6000
      option httplog
      option dontlog-normal
      default_backend datadog-agent

    backend datadog-agent
      mode http
      balance leastconn
      option tcp-check
      server-template srv 10 _datadog-agent._tcp.{{ include "vector.fullname" $ }}.{{ $.Release.Namespace }}.svc.cluster.local resolvers coredns check

jonwinton avatar Nov 21 '22 18:11 jonwinton

Chart Version: 0.14.0 One comment on this: we're blocked from upgrading due to our current EKS version not having the autoscaling APIs defined in current versions of the chart. We're in the progress of updating, but it'll be awhile

Hey @jonwinton IIRC there's nothing really stopping us from backporting the HPA options to older API's. I think I stopped where I did because there was some amount of structural change that happened between API versions and I figured if someone needed the support we'd get an issue opened.

We'd be happy to get an issue opened for that support if that would help you upgrade.

spencergilbert avatar Nov 21 '22 19:11 spencergilbert

@neuronull here we go:

Thanks @jonwinton !

Forgot to ask as well- what version of the Datadog Agent are you using? Was it also upgraded or did it's version stay the same? Can you share the DD Agent config as well?

Cheers~

neuronull avatar Nov 21 '22 21:11 neuronull

Still investigating this but just had an observation.

In v0.24.0 #13406 contained a fix in the datadog_metrics sink. The sink was not properly utilizing the DD api key from the event metadata. Because your configuration is utilizing the datadog_agent source ,and the datadog_metrics sink and the datadog_agent source by default forwards the API key- v0.24.0 introduced a change in behavior for your config.

I'm wondering if the problem here is that the DD API key which is being sent to Vector from the Datadog Agent's configuration , is perhaps different than the one which is being configured in the helm chart via the secrets? (Or otherwise overriding)

By default, the datadog_agent source is configured to store the API key if it is found in the headers from the Datadog Agent requests.

To disable that, set store_api_key to false. https://vector.dev/docs/reference/configuration/sources/datadog_agent/#store_api_key

Would you mind giving that a try @jonwinton ? Thanks!

neuronull avatar Nov 21 '22 22:11 neuronull

Sorry for the delay @neuronull and @spencergilbert ! Had some holiday time off and everything. Let me get you that info.

@neuronull I did test setting store_api_key and it didn't help.

jonwinton avatar Nov 28 '22 16:11 jonwinton

Forgot to ask as well- what version of the Datadog Agent are you using? Was it also upgraded or did it's version stay the same? Can you share the DD Agent config as well?

@neuronull the version of the DataDog agent has been locked to 7.39.1 for the duration of this test.

We'd be happy to get an issue opened for that support if that would help you upgrade.

@spencergilbert I think we're going to be stuck on the autoscaling/v1 API for the next 3-6 months, so if it's possible to support those versions that would be amazing 🙇

jonwinton avatar Nov 28 '22 17:11 jonwinton

@jonwinton I see you mentioned you tried setting store_api_key. To clarify, did you try setting it to false? That is: store_api_key: false on the datadog_agent source. As @neuronull pointed out there was a fix in 0.24.0 to start using the API key that Agent sent metrics with when forwarding to Datadog, by default, to match the behavior of logs.

Also, is that the complete Helm configuration? I expected to see the Vector configuration in it, but I just see the HAProxy configuration.

jszwedko avatar Nov 29 '22 18:11 jszwedko

@jszwedko yeah, tried setting it to false and it didn't change anything.

That is the full configuration other than the Vector configuration in the original issue body, which I'll re-paste below:

data_dir: /vector-data-dir

api:
  enabled: true
  address: 127.0.0.1:8686
  playground: false

sources:
  internal_logs:
    type: internal_logs

  # Datadog Agent telemetry
  datadog_agent:
    type: datadog_agent
    address: "0.0.0.0:6000"
    multiple_outputs: true # To automatically separate metrics and logs
    store_api_key: false

sinks:
  console:
    type: console
    inputs:
      - internal_logs
    target: stdout
    encoding:
      codec: json

  # Datadog metrics output
  datadog_metrics:
    type: datadog_metrics
    inputs:
      - <inputs>...
    api_key: "${DATADOG_API_KEY}"

jonwinton avatar Nov 29 '22 21:11 jonwinton

In v0.24.0 #13406 contained a fix in the datadog_metrics sink.

Just correcting myself- That change I was pointing out was included in v0.23.0, so that whole line of inquiry which followed would be invalid.

neuronull avatar Dec 02 '22 23:12 neuronull

Hi @jonwinton,

I've been trying to reproduce this locally and am not having much luck so far. :disappointed: Will summarize what I've tried and some observations, and maybe something will jump out at you.

Additional questions for you:

  1. Do you have any additional log snippets you could provide? For example, is Vector logging any Errors (not debug logs). And, perhaps the error logs from HAProxy?
  2. Since we're having trouble reproducing it, would you be open to trying some nightly builds between v0.23.3 and 0.24.0, and observing the first build which exhibits the issue? This could give us hints to which commits might have caused the issue and additionally raise confidence in the issue.
  3. Have you attempted this upgrade from 0.23.3 to 0.24.0 multiple times or just once?
  4. Have you performed similar upgrades of Vector in this environment before?

General observations:

  • The message "error shutting down IO" from the Vector logs, seems to be either (likely I suspect) benign or it's not a new issue with 0.24.0, or not specifically tied to the 504 error observed in the Agent logs.

    • This is a debug level print coming from the hyper crate
    • I saw this message in every iteration I tried, regardless of whether I upgraded/restarted the Vector pod.
  • So far I haven't been able to get a 504 error to show in the Agent logs

    • I've seen a 503 error a couple of times, between when Vector pod was uninstalled, and when it was re-installed, but it occurred only once in each case and went away after Vector was running again, and didn't show up every time. 503 while Vector is down seems pretty expected.

My setup:

  • minikube
  • Agent using the helm values outlined below.
  • Vector using the helm values outlined below.
  • A dogstasd client using the program outlined below.

Permutations I've tried:

  • vector 0.23.3 -> vector 0.24.0
  • vector 0.23.3 restart vector
  • vector 0.24.4 restart vector
  • all of the above with and without restarting the Agent pod
  • variations on the dogstastd client to adjust the emission rate of distribution metrics and number of distribution metrics.
  • variations on the delay between uninstalling vector and re-installing vector

dogstastd program

from datadog import initialize, statsd
import time
import random
import sys


def log(msg):
    sys.stdout.write('{}: {}\n'.format(time.strftime("%H:%M:%S"), msg))
    sys.stdout.flush()

options = {
    'statsd_host':'172.17.0.3',
    'statsd_port':8125
}

log("initializing")

initialize(**options)

log("initialized")

while (1):
    log("distribution")

    for n in range(0,99):
        name = "foo_metric.distribution_" + str(n)
        value = random.randint(0,999)
        statsd.distribution(name, value, tags=["environment:baz"])

    time.sleep(1)

Agent helm values

datadog:
  apiKey: <redacted>
  containerExclude: "name:vector"
  logs:
    enabled: true
    containerCollectAll: true
  apm:
    enabled: false
clusterAgent:
  enabled: false
agents:
  useConfigMap: true
  image:
    tag: 7.39.1
  customAgentConfig:
    kubelet_tls_verify: false
    vector:
      logs:
        enabled: true
        url: "http://vector-haproxy.default:8282"
      metrics:
        enabled: true
        url: "http://vector-haproxy.default:8282"
    dogstatsd_non_local_traffic: true

ports:
  - containerPort: 8125
    hostPort: 8125
    name: dogstatsdport
    protocol: UDP

Vector helm values

Effectively the same as your provided values , with the exception of tagging specific images, and hard coding the DD API key.

neuronull avatar Dec 07 '22 00:12 neuronull

Hey! Thanks for all the details! I'll try and answer everything here and will come back with deeper answers for some things I need to retrieve/update log levels for.


Do you have any additional log snippets you could provide? For example, is Vector logging any Errors (not debug logs). And, perhaps the error logs from HAProxy?

Let me go get some of those a little later. Pre-planned DR game day going on so a little distracted 😬

Since we're having trouble reproducing it, would you be open to trying some nightly builds between v0.23.3 and 0.24.0, and observing the first build which exhibits the issue? This could give us hints to which commits might have caused the issue and additionally raise confidence in the issue.

🤦 I'm sad I didn't think about doing this already. I will definitely do this.

Have you attempted this upgrade from 0.23.3 to 0.24.0 multiple times or just once?

Multiple times!

Have you performed similar upgrades of Vector in this environment before?

Yeah, we jumped onto vector in the 0.1x versions and slowly bumped up each version until 0.24.0


One other piece of context that might be helpful: this issue first appeared in our largest staging environment, so I'm wondering if it's related to volume. Are y'all running load testing benchmarks in CI? Screenshot to show the load where we're first encountering the error.

Screenshot 2022-12-07 at 8 24 48 AM

jonwinton avatar Dec 07 '22 16:12 jonwinton

I will definitely do this.

Thanks a bunch!

One other piece of context that might be helpful: this issue first appeared in our largest staging environment, so I'm wondering if it's related to volume. Are y'all running load testing benchmarks in CI? Screenshot to show the load where we're first encountering the error.

We do some sense of load testing within our soak regression tests- generally those soak tests will try to hit vector with the allocated resources available. But we don't have explicit testing to evaluate usage under different loads.

It is good to know that it might be volume related- I can take that into consideration for repro efforts going forward.

neuronull avatar Dec 08 '22 20:12 neuronull

So sorry for the delay @neuronull, but I'm working through the nightlies now. Will get you a report this week so it's ready for the new year 🙇

jonwinton avatar Dec 27 '22 20:12 jonwinton

@neuronull ok here are the results:

  • I found that the regression occurs with the nightly-2022-08-17-distroless-libc tag
  • The changes for that build can be found here
  • I know we've documented it before, but it's the same behavior as before once that nightly tag lands Screenshot 2022-12-27 at 2 27 48 PM

jonwinton avatar Dec 27 '22 22:12 jonwinton

Thanks for bisecting that down @jonwinton ! Just to confirm, do I take the above to mean that nightly-2022-08-16-distroless-libc did not exhibit the issue?

jszwedko avatar Dec 27 '22 22:12 jszwedko

@jszwedko correct! Here's a screenshot showing the version was nightly-2022-08-16-distroless-libc right before deploying the 2022-08-17 tag and everything was working as expected.

I'm working with DataDog Support to give you access to a notebook with the entire breakdown, but screenshots for now.

Screenshot 2022-12-27 at 3 06 38 PM

jonwinton avatar Dec 27 '22 23:12 jonwinton

Thanks @jonwinton ! That would seem to narrow down the suspicious commits to https://github.com/vectordotdev/vector/compare/ba1e1c4...01c7a36 then. Nothing immediately jumps out, but it should give us a much better place to start investigating from. The person investigating this is OOO this week, but we'll pick it back up in the new year. Appreciate all of the added context!

jszwedko avatar Dec 27 '22 23:12 jszwedko

Sounds good! Yeah I was beginning to look at https://github.com/vectordotdev/vector/pull/13973 since we only see the errors come up for distribution data, but I need a Rust pro to help out. That said, if any tagged releases can be pushed out that have any changes to test in our env, I can easily deploy them.

See y'all next year!

jonwinton avatar Dec 27 '22 23:12 jonwinton

Thanks for going through that effort and providing all that data @jonwinton , really helps!

I believe the hdrhistogram crate updated in #13973 can be ruled out as it only seems to be used in some example code.

After looking at that commit range, the one that stands out most suspiciously is #13960.

At first glance it looks like its unrelated because the code is only operating over Counter metric types, and here the affect type is distribution.

But then I noticed this in the helm values ...

"collect_counters_with_distributions": true

I'm currently working on finding out what that actually translates to in terms of the Agent's behavior.

But so far that commit is really the only one that jumps out to me.

We could also try a test of a private build containing the commits from 8-17 but minus this one.

neuronull avatar Jan 03 '23 19:01 neuronull

@neuronull dang, nice digging! 🙇

A private image works, or if you can give me the build commands for generating the libc image I can push an image into our private ECR repo. I tried digging into the build pipeline a bit, but my lack of Rust familiarity is holding me back a bit.

jonwinton avatar Jan 04 '23 00:01 jonwinton

Thanks @jonwinton !

Here are the commands to get to the correct state of the repo:

  1. git checkout 01c7a36
  2. git revert ef22eb9 2.a To resolve the conflict, open lib/vector-core/Cargo.toml and you want the line 89 to be test = ["vector-common/test"]

To build the docker image to your local cache:

  1. make package-deb-x86_64-unknown-musl 1.a (you can select different architectures too, see the available ones in Makefile)
  2. PUSH=false ./scripts/build-docker.sh

Let us know how that goes or if you have any problems! Very curious to hear the result.

neuronull avatar Jan 04 '23 21:01 neuronull

@jonwinton , sorry for the spam- Ignore that last comment's instructions (there are some incorrect bits). We're working on improving the procedure to follow. In the meantime, I'll create the image and push it to the vector repo for you.

neuronull avatar Jan 04 '23 22:01 neuronull

Alright, this image should do the trick @jonwinton. It contains that nightly build minus ef22eb9.

timberio/vector:39c98e3-distroless-libc

neuronull avatar Jan 05 '23 15:01 neuronull

Perfect! I'll test this now

jonwinton avatar Jan 05 '23 16:01 jonwinton

@neuronull confirmed that we don't see the same issue with this version! Thank you for pushing that version out 🙇

I guess next steps would be entirely on y'all's end?

jonwinton avatar Jan 05 '23 16:01 jonwinton

Oh dang, looking at that PR more, we're interested in maintaining the fix for this bug: https://github.com/vectordotdev/vector/issues/13870

We're also dealing with interval issues and this would be helpful once we can safely upgrade 😬

jonwinton avatar Jan 05 '23 16:01 jonwinton