vector Consistent aggregation for rate metrics

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

Right now our metrics pipeline consists of the various DogStatsD client libraries, The Datadog Agent, and then straight to Datadogs backend.

We would like to implement Vector as an aggregation layer due to growing metric costs from services with a large number of host tags so we can effectively aggregate those tags (replacing with some type of vector instance id tag or something) away without risking data loss from Datadog.

The problem is that when using DogStatsD, all metrics which are submitted in app as a count are converted to a rate within the agents processes when it flushes the data together. These rate metrics are interpreted by Vector as a count metric with the interval_ms field set to the value that the agent set the interval to and when it reaches the Datadog Sink, it submits it as a rate.

All of that works fine until you try to aggregate the rate metrics. The way Vector seems to handle aggregation of multiple points is by summing the total of all the count values and setting the interval_ms to the total interval of all points included in the window. This results in inconsistency within a given timeseries of what the interval_ms value is. Datadog is unable to handle inconsistency and is designed to assume all datapoints for a given metric name always have the same interval (though you can change what it is in the backend).

The inconsistency arises from the fact that hosts will have offset between submissions from eachother, and because the Datadog agent has an unconfigurable 10s flush and 15s report interval, each host alternates between sending 1 or 2 datapoints per time series. We also need to ensure no two time series can be sent with the same timestamp else datadog will drop them

Attempted Solutions

We have tried a number of things that somewhat work for other metric types but not rates.

Changing the timestamp to the vector clock time does help with misattribution and data duplication for non rate types, but for rates you still have inconsistency in the interval reported.

We have also tried modifying the interval_ms field with VRL but it seems this field is not allowed to be edited as it is not in the VRL object model or something like that.

Proposal

There are a few things that would allow us to get the outcome we want:

A setting on the Aggregate transform to always assign interval_ms to the same value as the transforms parameters
The ability to modify this field via VRL
A timestamp based bucketing transform for aggregation which also allows a window to stay open for some set duration (For example, aggregate events based on their timestamp in a 30s interval but hold the window open for 60s)
The same as the above as a standalone transform similar to window but purely time based might help too.

I am curious what workaround might already exist or if people have dealt with aggregating rate metrics from DogStatsD before, or if there is some other pattern that is recommended to accomplish the same goal.

References

No response

Version

No response

Jun 10 '25 20:06 rcassetta-figma

This draft PR gives a kinda vague idea of what I am talking about https://github.com/vectordotdev/vector/pull/23190 Sorry if this is garbage to read, I have never written Rust, just trying to give a direct example of the kind of functionality I am talking about with code.

Essentially the option to treat a rate metric as though its interval is the vector interval and not the original ones.

When you aggregate other metrics there is always some risk of misattribution since the aggregator isnt based on the metrics timestamps but based on the vector window, so that would still be the case here

Jun 11 '25 19:06 rcassetta-figma

Hi @rcassetta-figma, thank you for creating this issue.

Edit: See my response below. Please let me know if I misunderstood something.

Jun 12 '25 17:06 pront

Ok I think I see what you mean now. And I am leaning towards allowing users to overwrite interval_ms in VRL. There's precedence with overwriting timestamp (per https://vector.dev/docs/reference/configuration/transforms/remap/#event-data-model) so it is surprising to silently fail for interval_ms.

Config

schema:
  log_namespace: false

api:
  enabled: true

sources:
  s0:
    type: static_metrics
    interval_secs: 1
    metrics:
      - name: response_time
        kind: incremental
        value:
          counter:
            value: 1
        tags:
          a: "b"

transforms:
  t0:
    type: remap
    inputs:
      - s0
    metric_tag_values: full
    source: |
      # silently fails
      .interval_ms = 100  

  t1:
    type: aggregate
    inputs:
      - t0
    interval_ms: 5000

sinks:
  console:
    type: console
    inputs: [ "t1" ]
    encoding:
      codec: json
      json:
        pretty: true

Sample aggregated metric

{
  "name": "response_time",
  "namespace": "static",
  "tags": {
    "a": "b"
  },
  "timestamp": "2025-06-12T17:42:19.559497Z",
  "interval_ms": 4999,
  "kind": "incremental",
  "counter": {
    "value": 5.0
  }
}

Jun 12 '25 17:06 pront

Being able to overwrite this would be very workable, since it would allow us to enforce a consistent interval which dd requires, I wasnt sure how to do this myself straight away or if that would be okay (since it requires a change to that data model)

Jun 13 '25 17:06 rcassetta-figma

Good news, @thomasqueirozb prepared a PR for this: https://github.com/vectordotdev/vector/issues/23183! After this is released, you will be able to add a remap after the aggregate transform and edit that field.

Jun 16 '25 20:06 pront

Amazing! Thanks!

Jun 17 '25 16:06 rcassetta-figma