Question: Monitoring & Telemetry
Question
(I suspect this should rather be a discussion, please mark as such if it indeed should be) With the growth on lemmy, monitoring & telemetry becomes a higher priority.
- What are the current methods for monitoring?
- What do we plan to implement in the future?
From what I can see currently, a modified version of tracing-actix-web is used, which can expose opentelemetry traces. This does not however include any metrics (and opentelemetry metrics seem to still be in the experimental phase).
I'm just opening this to start a discussion around this, and to gather information if we indeed want to start including metrics
At the lowest level, you have PostgreSQL holding all the data. And I've been pleading that Lemmy server operators please share even the most basic metrics of what Lemmy is doing internally: https://lemmy.ml/post/1361757
The major instances of Lemmy are all throwing me 'nginx 500 errors' (and 404 errors) very rapidly for weeks, and I have not see anyone sharing reports on their error rates, logs, etc.
#3280 is related. It appears long ago someone added opentelemetry to the code base but it is only usable via building with the feature enabled. I am not strong enough in rust as in I just looked at it this week enough to make that PR to really be able to give much more insight, the Otel libraries for rust are still in beta for traces and alpha for metrics, but traces out of the gate are pretty useful on their own, metrics would be amazing but with them being alpha might be more of a pain with all the other things going on.
There aren't many other opensource tools for this, they are getting deprecated in favor of opentelemetry. https://opentelemetry.io/docs/instrumentation/rust/ https://github.com/open-telemetry/opentelemetry-rust/tree/main/examples
I think the thing is buy in from everyone involved, since this is in rust there aren't a huge amount of options so it'll have to go into the code base. I could be wrong, and extremely opinionated, but so far in my experience opentelemetry has been a real game changer.
@EStork09 I was already leaning towards using prometheus for metrics, and obviously opentelementry for tracing, since both good and stable as far as I know. and most other metrics systems can consume from a prometheus exporter.
But I'd definitely want more inputs from others before I look at implementing anything.
Hey @Rulasmur I agree, the opentelemetry components are already in the project and implemented they are just currently disabled by default. But a prometheus metric endpoint would be great while we wait for Rust OTEL metrics to become more stable.
Prometheus would be great.
I'm currently load testing lemmy locally, and maybe i found a connection leak inside the database connection pool (50 connections are configured, have seen over 70...). But without more insight into the internal states of the lemmy server, there is no good way to tell.
In addition, prometheus could be used to provide other statistics too, like users, number of posts and so on. This can be very helpful to detect things like bot activity and more.
I would have implemented prometheus metrics already, but my rust is to bad for that ;)
I've hacked in a Prometheus endpoint locally and used actix-web-prom to get metrics on the API endpoints. I put the Prometheus registry (the thing that collects metrics) into the LemmyContext so that it would be available for adding metrics on the DB connections, etc. I also added a new optional prometheus block to the config. I'll clean it up and open an MR if there's interest.
Here's some sample output:
# TYPE lemmy_api_http_requests_duration_seconds histogram
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="0.005"} 11
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="0.01"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="0.025"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="0.05"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="0.1"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="0.25"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="0.5"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="1"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="2.5"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="5"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="10"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200",le="+Inf"} 12
lemmy_api_http_requests_duration_seconds_sum{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200"} 0.030748336999999997
lemmy_api_http_requests_duration_seconds_count{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200"} 12
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="0.005"} 0
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="0.01"} 0
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="0.025"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="0.05"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="0.1"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="0.25"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="0.5"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="1"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="2.5"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="5"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="10"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="GET",status="200",le="+Inf"} 1
lemmy_api_http_requests_duration_seconds_sum{endpoint="/api/v3/community",method="GET",status="200"} 0.013951478
lemmy_api_http_requests_duration_seconds_count{endpoint="/api/v3/community",method="GET",status="200"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="0.005"} 0
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="0.01"} 0
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="0.025"} 0
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="0.05"} 0
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="0.1"} 0
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="0.25"} 0
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="0.5"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="1"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="2.5"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="5"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="10"} 1
lemmy_api_http_requests_duration_seconds_bucket{endpoint="/api/v3/community",method="POST",status="200",le="+Inf"} 1
lemmy_api_http_requests_duration_seconds_sum{endpoint="/api/v3/community",method="POST",status="200"} 0.25018257
lemmy_api_http_requests_duration_seconds_count{endpoint="/api/v3/community",method="POST",status="200"} 1
...
# HELP lemmy_api_http_requests_total Total number of HTTP requests
# TYPE lemmy_api_http_requests_total counter
lemmy_api_http_requests_total{endpoint="/api/v3/admin/registration_application/count",method="GET",status="200"} 12
lemmy_api_http_requests_total{endpoint="/api/v3/community",method="GET",status="200"} 1
lemmy_api_http_requests_total{endpoint="/api/v3/community",method="POST",status="200"} 1
lemmy_api_http_requests_total{endpoint="/api/v3/community/list",method="GET",status="200"} 4
lemmy_api_http_requests_total{endpoint="/api/v3/post/list",method="GET",status="200"} 2
lemmy_api_http_requests_total{endpoint="/api/v3/site",method="GET",status="200"} 1
lemmy_api_http_requests_total{endpoint="/api/v3/user/report_count",method="GET",status="200"} 12
lemmy_api_http_requests_total{endpoint="/api/v3/user/unread_count",method="GET",status="200"} 12
I'll clean it up and open an MR if there's interest.
Yes! Even if it isn't accepted into the main project, share how we can hack it into the codebase for our own compile. Thank you. P.S. There is hard-coded 10-second REQWEST timeout in Lemmy, is there something to look for when these are hit?
I'm refactoring it in preparation for an MR. Trying to make the prometheus stuff a feature in the build instead of always building/running it. I know enough Rust to get into trouble but not enough to be super productive and running in to a bunch of issues. So this may take some time.