vector icon indicating copy to clipboard operation
vector copied to clipboard

Datadog Sink: Multiple values for tags

Open joycse06 opened this issue 1 year ago • 14 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Context

We are trying to migrate our metrics pipeline from Vector Aggregator (running version 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc) to Vector Daemonset (running version 0.40.0) and have bumped into a issue where some Datadog metrics have double values for some tags (e.g. service).

The configuration for both of the pipelines are similar (no material difference in the metric transformations that could have caused this).

Where double tagging is happening

We have pods with multiple containers, and we override the service name for the envoy sidecar with DD_SERVICE (adding a -envoy) suffix so we can track metrics like resource utilisation separately for them.

The metrics are scraped by Datadog Agent, and it's working correctly with the Aggregator Pipeline (running with version 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc). However since moving metrics to the Vector Daemonset Pipeline (running vector version 0.40.0) we are seeing the value of the service tag has value like: serviceA,serviceA-envoy (it's including service tag of both envoy container and actual service container).

Here's an example: Screen Shot 2024-09-24 at 10 53 28 AM

As you can see, there's multiple values for the service tag for kubernetes.cpu.usage.total metric for the envoy sidecar.

I have captured output from the last stage of the transformation and piped it into a console sink and they look similar (here's the output from both pipeline: vector-daemonset-blip-manager-cpu-metric.json vector-deployment-blip-manager-cpu-metric.json).

I have been working with Datadog Support and they have suggested creating a issue here to see if you can provide any further insight.

Configuration

# Relevant portion, we are doing some more transformation but none of affect the metric we are having problem with
data_dir: /vector-data-dir
expire_metrics_secs: 60

api:
  enabled: true
  address: 127.0.0.1:8686
  playground: false

sources:
  # Datadog Agent telemetry
  datadog_agent:
    type: datadog_agent
    address: "0.0.0.0:6000"
    multiple_outputs: true # To automatically separate metrics and logs
transforms:
  tag_metrics:
    type: remap
    inputs:
      - datadog_agent.metrics
    source: |
      if .tags.service == null {
        .tags.service = "unknown"
      }

      if .tags.version == null {
        .tags.version = "unknown"
      }
sinks:
  datadog_metrics:
    type: datadog_metrics
    inputs:
      - tag_metrics
    default_api_key: ${DATADOG_API_KEY}
    buffer:
      when_full: drop_newest
      max_events: 100000

Version

0.40.0

Debug Output

No response

Example Data

No response

Additional Context

We attempted to upgrade the vector version in our Aggregator pipeline earlier but bumped into performance issues (https://github.com/vectordotdev/vector/issues/15292). We later decided to move to a Daemonset based pipeline.

References

No response

joycse06 avatar Sep 24 '24 01:09 joycse06

We are wondering if something has changed between vector version 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc and 0.40.0 that could have caused this.

As they were released ~1 yr apart, there must have been a lot of changes, too many to go through them all. We use the daemonset pipeline for logs processing as well so it's not easy for us to go through other versions between them to try to find when this double tagging issue was introduced.

I initially thought, it might have something to do with: metric_tag_values option of remap but as the captured output confirms DD agent is only sending a single value for the service tag, so it's not the case.

joycse06 avatar Sep 24 '24 01:09 joycse06

Thanks for the detailed report! Had you tried either of the following combinations:

  • 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc deployed as a daemonset
  • 0.40.0 deployed as an aggregator

? That might help isolate whether the issue is with the different image or if it is just happening when you switch from aggregator to daemonset.

jszwedko avatar Sep 26 '24 23:09 jszwedko

Had you tried either of the following combinations:

  • 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc deployed as a daemonset
  • 0.40.0 deployed as an aggregator

We haven't tried it yet, I can try deploying 0.40.0 as an aggregator and see if the problem occurs there as well. Deploying 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc as a daemonset will require a bit more work as we are processing logs in that pipeline as well.

I will try it out and report back.

joycse06 avatar Sep 30 '24 00:09 joycse06

@jszwedko Deployed 0.40.0 binary into the Aggregator deployment and the problem persisted. So it's not an problem arising from the switch from aggregator to daemonset.

I then went on a journey to bisect the versions to find out which version introduced this bug and it was introduced in 0.26.0. It works as expected up to 0.25.0.

~I don't know enough Rust to figure out which code change the bug is coming from but this PR: https://github.com/vectordotdev/vector/pull/12436 looks like a good candidate~. (https://github.com/vectordotdev/vector/pull/12436 was introduced in 0.27.0 so it's not in scope).

Please let me know if you need anything else from me.

joycse06 avatar Oct 01 '24 00:10 joycse06

Here's a snapshot of 0.26.0 introducing double tagging into the metric. You can see the metrics were correctly tagged and then suddenly a new series with double tag appears:

Screen Shot 2024-10-01 at 8 58 52 AM

joycse06 avatar Oct 01 '24 00:10 joycse06

@jszwedko Deployed 0.40.0 binary into the Aggregator deployment and the problem persisted. So it's not an problem arising from the switch from aggregator to daemonset.

Could you try deploying 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc deployed as a daemonset too? I'd just like to rule that out as a factor, though it seems like the issue was a code change based on the bisecting you did.

I then went on a journey to bisect the versions to find out which version introduced this bug and it was introduced in 0.26.0. It works as expected up to 0.25.0.

~I don't know enough Rust to figure out which code change the bug is coming from but this PR: #12436 looks like a good candidate~. (#12436 was introduced in 0.27.0 so it's not in scope).

Please let me know if you need anything else from me.

Thanks for narrowing this down! That should help significantly. I'll take a look through the commits and see if anything jumps out. If not, we may need a bit more info to help reproduce.

One thing that might be useful is the Datadog Agent version. Could you provide that?

jszwedko avatar Oct 09 '24 21:10 jszwedko

I think I see something. v0.26.0 was the first version where we started to add support for tags that have multiple values. Prior to v0.26.0, tags could only be a mapping of a key to a single value, but with v0.26.0 it started to become possible to have tags that are a mapping from a key to multiple values. This was officially announced with v0.27.0: https://vector.dev/highlights/2022-12-22-enhanced-metric-tags/

It is possible that a bug was introduced as part of the feature, but one guess here is that the Datadog Agent was actually always metrics with both tags, but that prior to Vector v0.26.0, Vector was just discarding all but one of the values (the last one it sees).

To validate this, you could use tcpdump to grab a capture of the requests going into Vector. From there, we could see if the incoming requests have metrics where the tags have multiple values. You could also use vector tap --outputs-of datadog_agent to show whether there are multiple tags on the metric as it is decoded by the Datadog Agent source (granted their could be a bug in the Datadog Agent source, but this would help narrow it down at least).

jszwedko avatar Oct 09 '24 23:10 jszwedko

This is the commit, specifically, that changed this behavior: https://github.com/vectordotdev/vector/commit/b26dede3e399d639b952a3da557785e6e2618a0f#diff-70985dbf22cadd829a20e225862502f5eac5890abf851428498603a614df1ef4

jszwedko avatar Oct 09 '24 23:10 jszwedko

I will try vector tap --outputs-of datadog_agent and tcpdump on the DD agent.

One thing that might be useful is the Datadog Agent version. Could you provide that? We were using datadog agent version 7.49.0 and but now upgrading to the latest version (7.57.2). It happens with both of these versions.

but one guess here is that the Datadog Agent was actually always metrics with both tags, but that prior to Vector v0.26.0, Vector was just discarding all but one of the values (the last one it sees).

I was thinking of that as well at some point, that's why I added a console sink to print out the affected metric before the sink, so I could see the actual tags before they hit the sink. But taking a tcpdump or vector tap would be better.

Could you try deploying 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc deployed as a daemonset too

This is not easy, as we are probably using features of vector that were not available in that version and we also process logs through the daemonsets, so that pipeline will be affected as well. But I will see what I can do. I will try tcpdump with datadog agent and vector tap first, then will consider this.

Thanks. I will report the results back as soon as I can.

joycse06 avatar Oct 15 '24 00:10 joycse06

Tried vector tap --outputs-of datadog_agent.metrics | grep --line-buffered "cpu\.usage\.total.*istio-proxy.*" and found some metric data points that are having issues of multiple values for the service tag, and datadog-agent is indeed sending in multiple values for service tag for those metrics. Example:

{"name":"cpu.usage.total","namespace":"kubernetes","tags":{"service":["blip-manager","blip-manager-envoy"]},"timestamp":"2024-10-15T02:02:56Z","kind":"absolute","gauge":{"value":7529716.172717634}}

I will reach out to Datadog support to figure out why the agent is doing that, but in the meantime I also tried setting: metric_tag_values: single in one of the remap transformations we have which according to doc should reduce the tag values into a single string (last assigned value) if I am not mistaken, but I am not seeing any change in the output from that transformation. Am I doing anything wrong?

Here's the config:

tag_metrics:
  type: remap
  metric_tag_values: single
  inputs:
    - datadog_agent.metrics

If metric_tag_values is not meant to be used like this, can we use VRL somehow to convert the value of service tag into a single string (the last value)?

joycse06 avatar Oct 15 '24 03:10 joycse06

I tried putting in the following VRL check in one of our transformations:

    if is_array(.tags.service) && .tags.container_name == "istio-proxy" {
        .tags.service = to_string!(.tags.service[-1])
    }

And a test:

- name: 'metrics: datadog agent metrics tagging : service tag is an array'
  inputs:
    - insert_at: tag_metrics
      type: metric
      metric:
        name: "website_hits"
        kind: "absolute"
        counter:
          value: 1
        tags:
          service: ["larry", "larry-envoy"]
          container_name: "istio-proxy"
  outputs:
    - extract_from: tag_metrics
      conditions:
        - type: vrl
          source: |-
            assert_eq!(.tags.service, "larry-envoy")

The test is passing, but when I inspect the output of that transformation, I still see multiple values for service tag.

joycse06 avatar Oct 15 '24 04:10 joycse06

Glad to hear it seems like we've narrowed this issue down. Hopefully Datadog Support can help clear up why you are seeing multiple service tags.

I will reach out to Datadog support to figure out why the agent is doing that, but in the meantime I also tried setting: metric_tag_values: single in one of the remap transformations we have which according to doc should reduce the tag values into a single string (last assigned value) if I am not mistaken, but I am not seeing any change in the output from that transformation. Am I doing anything wrong?

That option actually just configures how tags are exposed in a remap transform, but it doesn't modify the tag values unless you explicitly modify the tag. I can see why you would have expected that to work though.

I tried putting in the following VRL check in one of our transformations:

    if is_array(.tags.service) && .tags.container_name == "istio-proxy" {
        .tags.service = to_string!(.tags.service[-1])
    }

And a test:

- name: 'metrics: datadog agent metrics tagging : service tag is an array'
  inputs:
    - insert_at: tag_metrics
      type: metric
      metric:
        name: "website_hits"
        kind: "absolute"
        counter:
          value: 1
        tags:
          service: ["larry", "larry-envoy"]
          container_name: "istio-proxy"
  outputs:
    - extract_from: tag_metrics
      conditions:
        - type: vrl
          source: |-
            assert_eq!(.tags.service, "larry-envoy")

The test is passing, but when I inspect the output of that transformation, I still see multiple values for service tag.

I am able to reproduce this. It seems like a bug 🤔 I'll take a closer look. As a workaround this seems to work:

  transform0:
    inputs:
    - source0
    type: remap
    metric_tag_values: single
    source: |
      .tags = .tags

That overwrites tags with its "single-value" form.

jszwedko avatar Oct 15 '24 22:10 jszwedko

I filed https://github.com/vectordotdev/vector/issues/21512 to track the bug you seem to have ran into.

jszwedko avatar Oct 15 '24 22:10 jszwedko

Thanks for taking a look. I will go with .tags = .tags and metric_tag_values: single to unblock us for now while we work with Datadog Support to fix the root cause.

joycse06 avatar Oct 16 '24 00:10 joycse06