OpenTelemetry Investigation
@gravitystorm Raised an interest in the osm-website team being able to get Application Performance Monitoring style observability of the production website.
OpenTelemetry tracing is likely the best path, with suitable instrumentation.
From an Ops perspective the requirements (infrastructure, storage, services) need investigation.
The real question is probably what to use as the backend to collect the data.
We just started to investigate Grafana Alloy as an Open Telemetry collector at work: https://github.com/grafana/alloy
No opinions yet, but it might be worth trying since you have a Grafana install already.
I'm particularly interested in error monitoring (aka exception monitoring), in addition to performance monitoring. Users manage to break things in all kinds of unexpected ways, and it would be good to share this info with the developers. A key aspect here is bundling similar errors, and not alerting on every individual error.
I don't have enough knowledge of OpenTelemetry to know whether a) error monitoring should be handled by an entirely separate system or b) if it's done through OpenTelemetry, what component(s) would do the aggregation and alerting.
Yes the volume of data from the telemetry is probably more than we want to store in prometheus so we probably want something that can receive it and store recent data, or at least data that passes an "interestingness" filter such as exceptions, and also aggregate statistics like query times that can feed to prometheus for charting over time.
I've opened https://github.com/openstreetmap/openstreetmap-website/pull/6405 to enable support in the web site code.
As I understand it you then need a collector (be that the basic OpenTelemetry one or Grafana Alloy or something else) which you could run centrally though as I understand it the general idea is to run that locally on each machine so it can collect stuff up (including parsing logs if you want) and batch them up and submit them.
Then you want a trace store or some sort (maybe Grafana Tempo?) which might also do it's own metrics or the collector can generate metrics and send them to prometheus.
@iandees What backend did you use for storing the Telemetry data? What frontend for viewing the traces?
We haven't gotten that far yet, unfortunately.