veneur
veneur copied to clipboard
Rewrite metrics ingestion path and integrate with just the central veneur.
To make this a bit easier to review, check out the commits tab. Things are roughly broken out step-by-step.
Review guide
Sorry this is like 2000 lines. Most of it is testing, altho I get that that doesn't make it easier to review. FWIW, I asked @asf-stripe to look at this halfway through and got the thumbs up to move forward.
If you're not sure how to get started, I suggest going in this order:
- Review the main business logic: Start with aggingestor.go, go to aggworker.go, go to sinkflusher.go.
- Review the mixed histogram implementation if you're familiar with this: go to mixedhisto.go
- Review the test framework and maybe skim the test cases (this is most of the lines).
Summary
This PR does the following:
- Point the gRPC import server at the new code.
- Build out a metricsingester that works for the way central veneurs currently work: metrics aggregation is sharded across several workers and a flush function runs every so often to collect metrics and send them to sinks.
In other words, this is a non-breaking change. But it will allow local veneurs to forward all metrics to the central veneur.
Architecture
The new metrics ingester does everything the old one does but a bit cleaned up.
-
AggregatingIngestor is the entry point (aggingestor.go). Through the Ingest and Merge methods, the AggregatingIngestor shards metrics across a series of workers. At a set interval, it flushes the aggregated metrics out of the workers and feeds them into a flusher (see below).
-
AggregatingWorkers (aggworker.go) take in metrics and feed them into samplers, doing all the work of metrics aggregation. They work as separate goroutines to facilitate parallelism.
-
sinkflusher.go contains the code to flush metrics to sinks. Given a set of samplers, the sinkflusher generates raw metrics and sinks them to sinks.
Testing
I wrote a testing framework that allows us to express tests as table tests. So while there's only a few top level tests, we actually capture about 20 different test cases in TestE2EFlushingIngester, TestMixedHistoMerge, and TestMixedHistoSample.
We could use additional coverage at the ingestor level so that we're not just testing import-path cases or depend on the import path for testing. This will come in a later PR (unless y'all think we should have it now).
We also have the benefit of the old e2e test coverage because those tests should be using the new path as well.
Other things to note
Mixed histogram stuff
-
I wrote a new sampler, MixedHistogramSampler, to capture the mixed histo behavior of "percentiles are aggregated globally, min max count are emitted with host tags". There's a bunch of testing around this.
-
In order for things not to break as we roll out, it's necessary for us to not change the behavior of MixedScope histograms. Therefore, the behavior for the central veneur is to not emit mixed min/max/counts unless a metric is imported with a new scope, MixedGlobal.
Design doc
https://docs.google.com/document/d/1JXHLj0VI1nIiRcNLu3MXvy_qt2V38yvjj6ejW5oa4dA/edit#
r? @aditya-stripe @ChimeraCoder
Motivation
(stripe internal) https://docs.google.com/document/d/1C1IBh7AbWrHRkAJBbUHM0jx0XKg2lo3ISmffUrzaWT4/edit#
@asf-stripe feedback make sure to preserve e2e local -> central forwarding test.
Gerald Rule: Copy Observability on Veneur, Unilog, Falconer pull requests
cc @stripe/observability cc @stripe/observability-stripe
@clin-stripe did you intend to assign me to this PR? What can I do to help here?
@aubrey-stripe You mentioned at the falalfel that you'd be interested in reviewing so I thought I'd give you a heads up!
Missed this comment when we went over this together.
I see your work is pulling the Hostname property out from the metrics' tags; that's cool, but I am wondering what we're going to do with the other tags that we currently use to identify metrics from particular subsets of hosts (e.g. our host_type and host_az tag); or let's say I'm confused how the Hostname becomes more important than the others, and how that will relate to what we're going to do when we switch the default on Mixed scope Histograms (I would love for us to not report host_az and host_type on the global histograms, since those are always going to be lies, and at least the AZ tag adds cardinality that we really don't need).
Hostname is special only because it's used to make grouping decisions. In the special case of MixedHistogram, we need to make sure that hostname is not part of the grouping decision. That is, two mixedhistograms with the same name and tags need to be merged together regardless of their hostname. host_set and host_type, by contrast, are tags that we currently group histograms by.
Just to clarify, we're not planning on dropping mixedhistogram behavior in this change. It would be super nice to drop it of course ... but I think we need to do some work to migrate use cases for this first. Same with dropping host_az.
y no merge y'all
:'(