mastodon icon indicating copy to clipboard operation
mastodon copied to clipboard

Consider adopting tracing with OpenTelemetry

Open cartermp opened this issue 3 years ago • 1 comments

Pitch

Today it appears instrumentation for Mastodon uses stasd. While not negative, it's difficult to dig into the root cause behind a slowdown. For that, the best practice to adopt is tracing (which works great on monoliths too!). OpenTelemetry is the industry standard framework for tracing and correlating different signals. Especially as Mastodon grows and evolves with an influx of users from the other site, getting a good understanding of what leads to a slowdown is critical to resolving those problems quickly.

To see this in action, there's a branch by @robbkidd adding automatic instrumentation here: https://github.com/mastodon/mastodon/compare/main...robbkidd:mastodon:rub-some-otel-on-it

And a server owner trying it out for themselves here: https://zomglol.wtf/@jamie/109317703529088329

As you can see, automatic instrumentation lights up a bunch of things by default. No need for super duper fancy distributed computing blah blah blah -- just initialize a Ruby library and get good data out of it.

Motivation

A big reason to add tracing is being able to dig into slow behavior to understand why something is slow and what the full path of that slowness looked like. And as Mastodon grows, the number of strange reasons why something is slow will grow, making it much more difficult to divine the reason why just from some aggregate metrics or uncorrelated logs (assuming someone had the foresight to add them in the first place).

OpenTelemetry will generate traces for you that start at an HTTP request, make database calls, and then get back to the caller. This means you don't need to know what to "log" in advance - it's just tracked for you. You can also create manual spans in a trace that add additional context from within the app, like routines where important data processing is done.

When you do this with OpenTelemetry, you can correlate with other data, preprocess any data however you wish, and export anywhere (OSS tools like Jaeger/Grafana, and paid tools like Splunk/Honeycomb). If a particular tool doesn't work out for you, switching to another takes a few minutes and then that's it -- no need to swap SDKs or install proprietary agents.

cartermp avatar Nov 10 '22 15:11 cartermp

OpenTelemetry would be nice, quickly becoming a widely adopted standard as mentioned.

ineffyble avatar Nov 10 '22 15:11 ineffyble