vector Histograms and counters do not report the same summarized data

Here is the configuration visualization:

The idea is that status_metric_h is reporting histogram type metric using the same source as status_metric_c, which, in turn, reporting counter type metrics.

The histogram metric is reported as vector.cdn.status_h.ok and counter reported as vector.cdn.status_c.ok (other than ok we use redirect, notfound, other2xx, other3xx, other4xx, other5xx )

I expect that both metrics should have the same sum values in the DataDog. However, I see a big difference in the DataDog:

I would expect the 2 charts to be equivalent to each other, but no luck.

Could you please help us understand why we see such a difference?

Jul 11 '22 20:07 gaylatea

Relevant transform configs:

[transforms.status_metric_c]
inputs = ["logs_with_ts"]
type = "log_to_metric"
[[transforms.status_metric_c.metrics]]
field = "_const_one"
increment_by_value = true
name = "{{status}}"
namespace = "vector.cdn.status_c"
type = "counter"

# Status metrics with histogram type
[transforms.status_metric_h]
inputs = ["logs_with_ts"]
type = "log_to_metric"
[[transforms.status_metric_h.metrics]]
field = "_const_one"
increment_by_value = true
name = "{{status}}"
namespace = "vector.cdn.status_h"
type = "histogram"

Jul 11 '22 20:07 jszwedko

Hi folks. I would like to provide a bit more information. I was able to reproduce the same issue with the random data on my personal laptop. Here is what I did: The VRL file looks like this:

# generate fake timestamp
.timestamp = to_int(now()) - parse_int!(slice!(md5(random_bytes(64)), start: 0, end: 3), base:16)

# generate fake status code
.status_code = parse_int!(slice!(md5(random_bytes(64)), start: 0, end: 1), base:16) % 4


##
##  Define some constants for the log-to-metric
##
._const_one = 1.0


if .status_code == 0 {
    .status = "ok"
} else if .status_code == 1 {
    .status = "redirect"
} else if .status_code == 2 {
    .status = "notfound"
} else {
    .status = "other"
}

the toml file is

acknowledgements.enabled = true
data_dir = "/tmp"

[sources.file_source]
ignore_checkpoints = true
include = ["./ingest/*"]
read_from = "beginning"
remove_after_secs = 0
type = "file"

# parse the input source
[transforms.logparser]
drop_on_error = true
file = "./vrls/cdn_fake.vrl"
inputs = ["file_source"]
reroute_dropped = true
type = "remap"

# Parse the timestamp from the logs
[transforms.logs_with_ts]
inputs = ["logparser"]
type = "coercer"
types.timestamp = "timestamp|%s"


# Status metrics with counter type
[transforms.status_metric]
inputs = ["logs_with_ts"]
type = "log_to_metric"
[[transforms.status_metric.metrics]]
field = "_const_one"
increment_by_value = true
name = "{{status}}"
namespace = "vector.cdn.status_c"
type = "counter"

# Status metrics with histogram type
[transforms.status_metric_h]
inputs = ["logs_with_ts"]
type = "log_to_metric"
[[transforms.status_metric_h.metrics]]
field = "_const_one"
increment_by_value = true
name = "{{status}}"
namespace = "vector.cdn.status_h"
type = "histogram"

## Sink to the DataDog
[sinks.datadog_sink]
api_key = "${DD_API_KEY}"
default_namespace = "vector"
inputs = [
  "status_metric",
  "status_metric_h"
]
type = "datadog_metrics"
[sinks.datadog_sink.buffer]
type = "memory"
when_full = "block"

I basically created an empty file with ~29k lines and run the loop like

while true; do cp file-4.txt ./ingest/; done

The result I'm seeing in the DD is this:

Jul 12 '22 14:07 yury-sannikov

Thanks for the additional data @yury-sannikov ! That should help with reproduction.

Jul 26 '22 21:07 jszwedko

I've started looking at this, and I can confirm that I've reproduced the scenario you describe.

I've identified some changes that seem to bring the counter version up to a correct value when graphed in the DD UI, although I need to sort some things out there before working on a fix.

As far as the value from the histogram version, that is still a mystery to me, and I'm continuing to work on tracking down why it appears to be such an inflated value.

Aug 04 '22 20:08 tobz

After some further investigation, I've identified the issues at play.

First off, the value for the histogram metric is correct. The sneaky aspect of the UI is that histograms are rolled up to 10s time window on the Datadog side, and so in that graph, depending on the time window, you may see a discrepancy as high as 10:1, because the counter is reported every second vs the 10s rollup for histograms.

Secondly, and more important, we do have some issues on the Vector side, namely that metrics generated via log_to_metric do not have a timestamp correctly generated, and the datadog_metrics sink doesn't correctly aggregate datapoints that occur in the same second, leading to the suppressed count value you're seeing in the graphs. I've filed #13860 and #13867, respectively, to track fixing these issues.

Ultimately, the above graph is caused entirely by the two issues I mentioned, but you may still see what appear to be discrepancies if the time window is small enough, even after we fix those two issues.

Aug 05 '22 18:08 tobz

Closing since the two issues mentioned above were closed. @silversupreme curious if you still see this issue in the latest nightly. Feel free to reopen if so.

Aug 30 '22 21:08 jszwedko

Yup, I see metrics improved, though it broke some other things. Thank you!

Sep 01 '22 09:09 yury-sannikov

Yup, I see metrics improved, though it broke some other things. Thank you!

Thanks for verifying! What else did it break? Do you mind opening a new issue for it?

Sep 01 '22 14:09 jszwedko

Yup, I see metrics improved, though it broke some other things. Thank you!

Thanks for verifying! What else did it break? Do you mind opening a new issue for it?

-> https://github.com/vectordotdev/vector/issues/14232 Agree that it's worth a separate issue.

Sep 01 '22 14:09 spencergilbert

vector vector copied to clipboard

Histograms and counters do not report the same summarized data

vector
vector copied to clipboard