Datadog Sink: Multiple values for tags
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Context
We are trying to migrate our metrics pipeline from Vector Aggregator (running version 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc) to Vector Daemonset (running version 0.40.0) and have bumped into a issue where some Datadog metrics have double values for some tags (e.g. service).
The configuration for both of the pipelines are similar (no material difference in the metric transformations that could have caused this).
Where double tagging is happening
We have pods with multiple containers, and we override the service name for the envoy sidecar with DD_SERVICE (adding a -envoy) suffix so we can track metrics like resource utilisation separately for them.
The metrics are scraped by Datadog Agent, and it's working correctly with the Aggregator Pipeline (running with version 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc). However since moving metrics to the Vector Daemonset Pipeline (running vector version 0.40.0) we are seeing the value of the service tag has value like: serviceA,serviceA-envoy (it's including service tag of both envoy container and actual service container).
Here's an example:
As you can see, there's multiple values for the service tag for kubernetes.cpu.usage.total metric for the envoy sidecar.
I have captured output from the last stage of the transformation and piped it into a console sink and they look similar (here's the output from both pipeline: vector-daemonset-blip-manager-cpu-metric.json vector-deployment-blip-manager-cpu-metric.json).
I have been working with Datadog Support and they have suggested creating a issue here to see if you can provide any further insight.
Configuration
# Relevant portion, we are doing some more transformation but none of affect the metric we are having problem with
data_dir: /vector-data-dir
expire_metrics_secs: 60
api:
enabled: true
address: 127.0.0.1:8686
playground: false
sources:
# Datadog Agent telemetry
datadog_agent:
type: datadog_agent
address: "0.0.0.0:6000"
multiple_outputs: true # To automatically separate metrics and logs
transforms:
tag_metrics:
type: remap
inputs:
- datadog_agent.metrics
source: |
if .tags.service == null {
.tags.service = "unknown"
}
if .tags.version == null {
.tags.version = "unknown"
}
sinks:
datadog_metrics:
type: datadog_metrics
inputs:
- tag_metrics
default_api_key: ${DATADOG_API_KEY}
buffer:
when_full: drop_newest
max_events: 100000
Version
0.40.0
Debug Output
No response
Example Data
No response
Additional Context
We attempted to upgrade the vector version in our Aggregator pipeline earlier but bumped into performance issues (https://github.com/vectordotdev/vector/issues/15292). We later decided to move to a Daemonset based pipeline.
References
No response
We are wondering if something has changed between vector version 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc and 0.40.0 that could have caused this.
As they were released ~1 yr apart, there must have been a lot of changes, too many to go through them all. We use the daemonset pipeline for logs processing as well so it's not easy for us to go through other versions between them to try to find when this double tagging issue was introduced.
I initially thought, it might have something to do with: metric_tag_values option of remap but as the captured output confirms DD agent is only sending a single value for the service tag, so it's not the case.
Thanks for the detailed report! Had you tried either of the following combinations:
2023-01-27_backport_dd_metrics_interval_fix-distroless-libcdeployed as a daemonset0.40.0deployed as an aggregator
? That might help isolate whether the issue is with the different image or if it is just happening when you switch from aggregator to daemonset.
Had you tried either of the following combinations:
- 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc deployed as a daemonset
- 0.40.0 deployed as an aggregator
We haven't tried it yet, I can try deploying 0.40.0 as an aggregator and see if the problem occurs there as well. Deploying 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc as a daemonset will require a bit more work as we are processing logs in that pipeline as well.
I will try it out and report back.
@jszwedko Deployed 0.40.0 binary into the Aggregator deployment and the problem persisted. So it's not an problem arising from the switch from aggregator to daemonset.
I then went on a journey to bisect the versions to find out which version introduced this bug and it was introduced in 0.26.0. It works as expected up to 0.25.0.
~I don't know enough Rust to figure out which code change the bug is coming from but this PR: https://github.com/vectordotdev/vector/pull/12436 looks like a good candidate~. (https://github.com/vectordotdev/vector/pull/12436 was introduced in 0.27.0 so it's not in scope).
Please let me know if you need anything else from me.
Here's a snapshot of 0.26.0 introducing double tagging into the metric. You can see the metrics were correctly tagged and then suddenly a new series with double tag appears:
@jszwedko Deployed
0.40.0binary into the Aggregator deployment and the problem persisted. So it's not an problem arising from the switch from aggregator to daemonset.
Could you try deploying 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc deployed as a daemonset too? I'd just like to rule that out as a factor, though it seems like the issue was a code change based on the bisecting you did.
I then went on a journey to bisect the versions to find out which version introduced this bug and it was introduced in
0.26.0. It works as expected up to0.25.0.~I don't know enough Rust to figure out which code change the bug is coming from but this PR: #12436 looks like a good candidate~. (#12436 was introduced in
0.27.0so it's not in scope).Please let me know if you need anything else from me.
Thanks for narrowing this down! That should help significantly. I'll take a look through the commits and see if anything jumps out. If not, we may need a bit more info to help reproduce.
One thing that might be useful is the Datadog Agent version. Could you provide that?
I think I see something. v0.26.0 was the first version where we started to add support for tags that have multiple values. Prior to v0.26.0, tags could only be a mapping of a key to a single value, but with v0.26.0 it started to become possible to have tags that are a mapping from a key to multiple values. This was officially announced with v0.27.0: https://vector.dev/highlights/2022-12-22-enhanced-metric-tags/
It is possible that a bug was introduced as part of the feature, but one guess here is that the Datadog Agent was actually always metrics with both tags, but that prior to Vector v0.26.0, Vector was just discarding all but one of the values (the last one it sees).
To validate this, you could use tcpdump to grab a capture of the requests going into Vector. From there, we could see if the incoming requests have metrics where the tags have multiple values. You could also use vector tap --outputs-of datadog_agent to show whether there are multiple tags on the metric as it is decoded by the Datadog Agent source (granted their could be a bug in the Datadog Agent source, but this would help narrow it down at least).
This is the commit, specifically, that changed this behavior: https://github.com/vectordotdev/vector/commit/b26dede3e399d639b952a3da557785e6e2618a0f#diff-70985dbf22cadd829a20e225862502f5eac5890abf851428498603a614df1ef4
I will try vector tap --outputs-of datadog_agent and tcpdump on the DD agent.
One thing that might be useful is the Datadog Agent version. Could you provide that? We were using datadog agent version
7.49.0and but now upgrading to the latest version (7.57.2). It happens with both of these versions.
but one guess here is that the Datadog Agent was actually always metrics with both tags, but that prior to Vector v0.26.0, Vector was just discarding all but one of the values (the last one it sees).
I was thinking of that as well at some point, that's why I added a console sink to print out the affected metric before the sink, so I could see the actual tags before they hit the sink. But taking a tcpdump or vector tap would be better.
Could you try deploying 2023-01-27_backport_dd_metrics_interval_fix-distroless-libc deployed as a daemonset too
This is not easy, as we are probably using features of vector that were not available in that version and we also process logs through the daemonsets, so that pipeline will be affected as well. But I will see what I can do. I will try tcpdump with datadog agent and vector tap first, then will consider this.
Thanks. I will report the results back as soon as I can.
Tried vector tap --outputs-of datadog_agent.metrics | grep --line-buffered "cpu\.usage\.total.*istio-proxy.*" and found some metric data points that are having issues of multiple values for the service tag, and datadog-agent is indeed sending in multiple values for service tag for those metrics. Example:
{"name":"cpu.usage.total","namespace":"kubernetes","tags":{"service":["blip-manager","blip-manager-envoy"]},"timestamp":"2024-10-15T02:02:56Z","kind":"absolute","gauge":{"value":7529716.172717634}}
I will reach out to Datadog support to figure out why the agent is doing that, but in the meantime I also tried setting: metric_tag_values: single in one of the remap transformations we have which according to doc should reduce the tag values into a single string (last assigned value) if I am not mistaken, but I am not seeing any change in the output from that transformation. Am I doing anything wrong?
Here's the config:
tag_metrics:
type: remap
metric_tag_values: single
inputs:
- datadog_agent.metrics
If metric_tag_values is not meant to be used like this, can we use VRL somehow to convert the value of service tag into a single string (the last value)?
I tried putting in the following VRL check in one of our transformations:
if is_array(.tags.service) && .tags.container_name == "istio-proxy" {
.tags.service = to_string!(.tags.service[-1])
}
And a test:
- name: 'metrics: datadog agent metrics tagging : service tag is an array'
inputs:
- insert_at: tag_metrics
type: metric
metric:
name: "website_hits"
kind: "absolute"
counter:
value: 1
tags:
service: ["larry", "larry-envoy"]
container_name: "istio-proxy"
outputs:
- extract_from: tag_metrics
conditions:
- type: vrl
source: |-
assert_eq!(.tags.service, "larry-envoy")
The test is passing, but when I inspect the output of that transformation, I still see multiple values for service tag.
Glad to hear it seems like we've narrowed this issue down. Hopefully Datadog Support can help clear up why you are seeing multiple service tags.
I will reach out to Datadog support to figure out why the agent is doing that, but in the meantime I also tried setting:
metric_tag_values: singlein one of theremaptransformations we have which according to doc should reduce the tag values into a single string (last assigned value) if I am not mistaken, but I am not seeing any change in the output from that transformation. Am I doing anything wrong?
That option actually just configures how tags are exposed in a remap transform, but it doesn't modify the tag values unless you explicitly modify the tag. I can see why you would have expected that to work though.
I tried putting in the following VRL check in one of our transformations:
if is_array(.tags.service) && .tags.container_name == "istio-proxy" { .tags.service = to_string!(.tags.service[-1]) }And a test:
- name: 'metrics: datadog agent metrics tagging : service tag is an array' inputs: - insert_at: tag_metrics type: metric metric: name: "website_hits" kind: "absolute" counter: value: 1 tags: service: ["larry", "larry-envoy"] container_name: "istio-proxy" outputs: - extract_from: tag_metrics conditions: - type: vrl source: |- assert_eq!(.tags.service, "larry-envoy")The test is passing, but when I inspect the output of that transformation, I still see multiple values for service tag.
I am able to reproduce this. It seems like a bug 🤔 I'll take a closer look. As a workaround this seems to work:
transform0:
inputs:
- source0
type: remap
metric_tag_values: single
source: |
.tags = .tags
That overwrites tags with its "single-value" form.
I filed https://github.com/vectordotdev/vector/issues/21512 to track the bug you seem to have ran into.
Thanks for taking a look. I will go with .tags = .tags and metric_tag_values: single to unblock us for now while we work with Datadog Support to fix the root cause.