veneur
veneur copied to clipboard
Move runtime metrics reporting to a separate timer routine.
Summary
Move runtime metric reporting to a separate ticker and goroutine.
Motivation
At present our emission of runtime metrics — gc, heap etc — is tied directly to the flusher. This seems like a bad mix of concerns. It also means that our runtime metrics are tied directly to our flush "tick". In a future where this is adjustable or non-existent we need a plan!
Notes
I also added the current goroutine count, which seems like good information in case we leak goroutines.
Test plan
Existing tests. We don't test these now and this doesn't change that.
r? @stripe/observability
Gerald Rule: Copy Observability on Veneur and Unilog pull requests
cc @stripe/observability cc @stripe/observability-stripe
r? @stripe/observability
I don't think I have the state + focus to do this during /dev/start, sorry!
r? @stripe/observability
Hey @aditya-stripe, this one is ready. Note that the intervals for this and the flush are ~synchronized and it should do what you suggested in a separate PR with regards to being at a high water mark. I would argue that it's important it be separated so as not to just give us the skewed, flush-time numbers. Having a separate ticker without the "bucketing" behavior should give us something that has a bit of periodicity to it which I think is an improvement.
At the moment, we're collecting these metrics immediately after each flush is initiated. This may or may not actually be the exact peak (or nadir) for metrics like memory usage, because of timing jitter as goroutines are scheduled. Though it is likely to be close, and at the very least, highly-correlated with the higher end of the distribution.
In reality, what we want is a measurement of the full range of the distribution, because these will change over the span of the 10-second flush cycle. We'll want to choose something that's less than 5 seconds, like 3s or 4s.
And if we're emitting these metrics multiple times per flush cycle, we'll also want to change them to be histograms, not gauges.
So, lgtm for now, but let's file a ticket to update this as well.