Consistent aggregation for rate metrics
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Use Cases
Right now our metrics pipeline consists of the various DogStatsD client libraries, The Datadog Agent, and then straight to Datadogs backend.
We would like to implement Vector as an aggregation layer due to growing metric costs from services with a large number of host tags so we can effectively aggregate those tags (replacing with some type of vector instance id tag or something) away without risking data loss from Datadog.
The problem is that when using DogStatsD, all metrics which are submitted in app as a count are converted to a rate within the agents processes when it flushes the data together. These rate metrics are interpreted by Vector as a count metric with the interval_ms field set to the value that the agent set the interval to and when it reaches the Datadog Sink, it submits it as a rate.
All of that works fine until you try to aggregate the rate metrics. The way Vector seems to handle aggregation of multiple points is by summing the total of all the count values and setting the interval_ms to the total interval of all points included in the window. This results in inconsistency within a given timeseries of what the interval_ms value is. Datadog is unable to handle inconsistency and is designed to assume all datapoints for a given metric name always have the same interval (though you can change what it is in the backend).
The inconsistency arises from the fact that hosts will have offset between submissions from eachother, and because the Datadog agent has an unconfigurable 10s flush and 15s report interval, each host alternates between sending 1 or 2 datapoints per time series. We also need to ensure no two time series can be sent with the same timestamp else datadog will drop them
Attempted Solutions
We have tried a number of things that somewhat work for other metric types but not rates.
Changing the timestamp to the vector clock time does help with misattribution and data duplication for non rate types, but for rates you still have inconsistency in the interval reported.
We have also tried modifying the interval_ms field with VRL but it seems this field is not allowed to be edited as it is not in the VRL object model or something like that.
Proposal
There are a few things that would allow us to get the outcome we want:
- A setting on the Aggregate transform to always assign
interval_msto the same value as the transforms parameters - The ability to modify this field via VRL
- A timestamp based bucketing transform for aggregation which also allows a window to stay open for some set duration (For example, aggregate events based on their timestamp in a 30s interval but hold the window open for 60s)
- The same as the above as a standalone transform similar to
windowbut purely time based might help too.
I am curious what workaround might already exist or if people have dealt with aggregating rate metrics from DogStatsD before, or if there is some other pattern that is recommended to accomplish the same goal.
References
No response
Version
No response
This draft PR gives a kinda vague idea of what I am talking about https://github.com/vectordotdev/vector/pull/23190 Sorry if this is garbage to read, I have never written Rust, just trying to give a direct example of the kind of functionality I am talking about with code.
Essentially the option to treat a rate metric as though its interval is the vector interval and not the original ones.
When you aggregate other metrics there is always some risk of misattribution since the aggregator isnt based on the metrics timestamps but based on the vector window, so that would still be the case here
Hi @rcassetta-figma, thank you for creating this issue.
Edit: See my response below. Please let me know if I misunderstood something.
Ok I think I see what you mean now. And I am leaning towards allowing users to overwrite interval_ms in VRL. There's precedence with overwriting timestamp (per https://vector.dev/docs/reference/configuration/transforms/remap/#event-data-model) so it is surprising to silently fail for interval_ms.
Config
schema:
log_namespace: false
api:
enabled: true
sources:
s0:
type: static_metrics
interval_secs: 1
metrics:
- name: response_time
kind: incremental
value:
counter:
value: 1
tags:
a: "b"
transforms:
t0:
type: remap
inputs:
- s0
metric_tag_values: full
source: |
# silently fails
.interval_ms = 100
t1:
type: aggregate
inputs:
- t0
interval_ms: 5000
sinks:
console:
type: console
inputs: [ "t1" ]
encoding:
codec: json
json:
pretty: true
Sample aggregated metric
{
"name": "response_time",
"namespace": "static",
"tags": {
"a": "b"
},
"timestamp": "2025-06-12T17:42:19.559497Z",
"interval_ms": 4999,
"kind": "incremental",
"counter": {
"value": 5.0
}
}
Being able to overwrite this would be very workable, since it would allow us to enforce a consistent interval which dd requires, I wasnt sure how to do this myself straight away or if that would be okay (since it requires a change to that data model)
Good news, @thomasqueirozb prepared a PR for this: https://github.com/vectordotdev/vector/issues/23183! After this is released, you will be able to add a remap after the aggregate transform and edit that field.
Amazing! Thanks!