vector
vector copied to clipboard
Histograms and counters do not report the same summarized data
Here is the configuration visualization:
The idea is that status_metric_h is reporting histogram
type metric using the same source as status_metric_c, which, in turn, reporting counter
type metrics.
The histogram
metric is reported as vector.cdn.status_h.ok
and counter reported as vector.cdn.status_c.ok
(other than ok
we use redirect, notfound, other2xx, other3xx, other4xx, other5xx )
I expect that both metrics should have the same sum
values in the DataDog. However, I see a big difference in the DataDog:
I would expect the 2 charts to be equivalent to each other, but no luck.
Could you please help us understand why we see such a difference?
Relevant transform configs:
[transforms.status_metric_c]
inputs = ["logs_with_ts"]
type = "log_to_metric"
[[transforms.status_metric_c.metrics]]
field = "_const_one"
increment_by_value = true
name = "{{status}}"
namespace = "vector.cdn.status_c"
type = "counter"
# Status metrics with histogram type
[transforms.status_metric_h]
inputs = ["logs_with_ts"]
type = "log_to_metric"
[[transforms.status_metric_h.metrics]]
field = "_const_one"
increment_by_value = true
name = "{{status}}"
namespace = "vector.cdn.status_h"
type = "histogram"
Hi folks. I would like to provide a bit more information. I was able to reproduce the same issue with the random data on my personal laptop. Here is what I did: The VRL file looks like this:
# generate fake timestamp
.timestamp = to_int(now()) - parse_int!(slice!(md5(random_bytes(64)), start: 0, end: 3), base:16)
# generate fake status code
.status_code = parse_int!(slice!(md5(random_bytes(64)), start: 0, end: 1), base:16) % 4
##
## Define some constants for the log-to-metric
##
._const_one = 1.0
if .status_code == 0 {
.status = "ok"
} else if .status_code == 1 {
.status = "redirect"
} else if .status_code == 2 {
.status = "notfound"
} else {
.status = "other"
}
the toml file is
acknowledgements.enabled = true
data_dir = "/tmp"
[sources.file_source]
ignore_checkpoints = true
include = ["./ingest/*"]
read_from = "beginning"
remove_after_secs = 0
type = "file"
# parse the input source
[transforms.logparser]
drop_on_error = true
file = "./vrls/cdn_fake.vrl"
inputs = ["file_source"]
reroute_dropped = true
type = "remap"
# Parse the timestamp from the logs
[transforms.logs_with_ts]
inputs = ["logparser"]
type = "coercer"
types.timestamp = "timestamp|%s"
# Status metrics with counter type
[transforms.status_metric]
inputs = ["logs_with_ts"]
type = "log_to_metric"
[[transforms.status_metric.metrics]]
field = "_const_one"
increment_by_value = true
name = "{{status}}"
namespace = "vector.cdn.status_c"
type = "counter"
# Status metrics with histogram type
[transforms.status_metric_h]
inputs = ["logs_with_ts"]
type = "log_to_metric"
[[transforms.status_metric_h.metrics]]
field = "_const_one"
increment_by_value = true
name = "{{status}}"
namespace = "vector.cdn.status_h"
type = "histogram"
## Sink to the DataDog
[sinks.datadog_sink]
api_key = "${DD_API_KEY}"
default_namespace = "vector"
inputs = [
"status_metric",
"status_metric_h"
]
type = "datadog_metrics"
[sinks.datadog_sink.buffer]
type = "memory"
when_full = "block"
I basically created an empty file with ~29k lines and run the loop like
while true; do cp file-4.txt ./ingest/; done
The result I'm seeing in the DD is this:
Thanks for the additional data @yury-sannikov ! That should help with reproduction.
I've started looking at this, and I can confirm that I've reproduced the scenario you describe.
I've identified some changes that seem to bring the counter version up to a correct value when graphed in the DD UI, although I need to sort some things out there before working on a fix.
As far as the value from the histogram version, that is still a mystery to me, and I'm continuing to work on tracking down why it appears to be such an inflated value.
After some further investigation, I've identified the issues at play.
First off, the value for the histogram metric is correct. The sneaky aspect of the UI is that histograms are rolled up to 10s time window on the Datadog side, and so in that graph, depending on the time window, you may see a discrepancy as high as 10:1, because the counter is reported every second vs the 10s rollup for histograms.
Secondly, and more important, we do have some issues on the Vector side, namely that metrics generated via log_to_metric
do not have a timestamp correctly generated, and the datadog_metrics
sink doesn't correctly aggregate datapoints that occur in the same second, leading to the suppressed count value you're seeing in the graphs. I've filed #13860 and #13867, respectively, to track fixing these issues.
Ultimately, the above graph is caused entirely by the two issues I mentioned, but you may still see what appear to be discrepancies if the time window is small enough, even after we fix those two issues.
Closing since the two issues mentioned above were closed. @silversupreme curious if you still see this issue in the latest nightly. Feel free to reopen if so.
Yup, I see metrics improved, though it broke some other things. Thank you!
Yup, I see metrics improved, though it broke some other things. Thank you!
Thanks for verifying! What else did it break? Do you mind opening a new issue for it?
Yup, I see metrics improved, though it broke some other things. Thank you!
Thanks for verifying! What else did it break? Do you mind opening a new issue for it?
-> https://github.com/vectordotdev/vector/issues/14232 Agree that it's worth a separate issue.