loki Observe how long we block writing our own logs

What this PR does / why we need it: In our Grafana Cloud service, we're seeing some pods logging thousands of log lines a second. Each call to Log() incurs a write syscall, which is quite expensive at that scale.

Under benchmark conditions, each of these syscalls take ~1000ns to run - which is not much, but 30k of those per second translates to 30ms of unnecessary latency, not to mention the horror that that many context-switches incurs in the kernel & Go scheduler. 30k logs per second is not hyperbole - that is what we're seeing in our production environments on occasion!

These measurements are extremely small so normal CPU profiling doesn't observe them completely; the pprof CPU profiler runs at 100Hz by default, which is 100x per second or once every ~10ms. These operations are happening on a ~microsecond scale so at best the CPU profile is incomplete (10ms = 10,000μs).

This PR adds a timing metric around all calls to Log(), which measures the total amount of time we've spent blocking while writing log lines to stderr.

We're using go-kit/log under the hood, and it doesn't perform any buffering of the lines it writes. I've forked this lib and implemented a line-buffered logger which we will test once this PR rolls out so we can measure the impact.

Once we roll out this line-buffered logger, we should see a significant drop in the amount of time blocking - and we should see a concomitant decrease in context-switches

Which issue(s) this PR fixes: N/A

Special notes for your reviewer:

Checklist

[ ] Documentation added
[ ] Tests updated
[ ] Is this an important fix or new feature? Add an entry in the CHANGELOG.md.
[ ] Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

Aug 01 '22 13:08 dannykopping

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Aug 01 '22 13:08 grafanabot

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

Sep 21 '22 06:09 stale[bot]

Closing this, since https://github.com/grafana/loki/pull/6954 is merged.

It's incredibly difficult to observe the performance impact of thousands of short events per second, but the line-buffered logger at least is more mechanically sympathetic to the way the Go scheduler works. We will test it out for a couple weeks and see if there is any measurable impact.

Oct 04 '22 13:10 dannykopping

loki loki copied to clipboard

Observe how long we block writing our own logs

loki
loki copied to clipboard