Epic: Add ability to export product and performance metrics when running Flagsmith self-hosted
Specifically, we want people to be able to export metrics into Prometheus to monitor the performance of their Flagsmith installation, but there may also be product metrics that it would be useful to export as well.
Hey, is there an alternative implementation for this
@shashank-sarvam Flagsmith currently has no support for any Prometheus metrics, if that's what you are asking.
As of 2.170.0 we now support Prometheus metrics. Documentation to follow.
Flagsmith currently offers HTTP and task processing performance metrics in Prometheus format.
For a product metric use case, we can't use Prometheus labels directly as we're limited to low cardinality attributes. Currently, we're looking into supporting OpenMetrics as a way to provide high cardinality attributes such as user/project/environment identifiers as [exemplars] (1, 2). OpenMetrics format is supported by a variety of scrapers including OpenTelemetry Prometheus receiver and Influx's Telegraf.
Why not use OpenTelemetry metrics directly? We believe sticking to native Prometheus keeps our self-hosted support more flexible and lean. We are still compatible with OTel when we decide to implement it, as span/trace correlation can be implemented via exemplars.
While we have implemented a prometheus integration in the product, this epic should be kept open for tracking the larger solution around how we get product metrics from our self-hosted customer base to help with our product backlog development.
is it already possible to inspect somehow the latency of flagsmith api, in terms of where the latency takes place? it seems that without opentelemetry support, the only way to find out whether it gets slow responses from DB or gets slow somewhere on disk IO, or instance networking etc' is by correlating tcpdump packets order of data traveling back and forth to db instance (when on separate instances) and the node linux metrics? opentelemetry would really help to spot the problematic span immediately when networking or external service like DB is involved 🤔 +1 to this feature 😊
@dima-sh-papaya we use Sentry for our SaaS deployment of Flagsmith. While it's not documented, you can use Sentry for this level of detail if you want it by populating the settings you can see here.
@dima-sh-papaya we use Sentry for our SaaS deployment of Flagsmith. While it's not documented, you can use Sentry for this level of detail if you want it by populating the settings you can see here.
awesome, that might help. We see random latency of /health/readiness which coincide with /api/v1/identities latency (as if all flagsmith api gets slow so any request incoming at that time will have long response) but cannot see any compute resource pressure not in the k8s nodes or the RDS database, so a bit lost in terms of how to profile or investigate this random latency. For some reason the "liveness" probe is within 5ms all the time, but "readiness" probe has the occasional issue, what does the readiness probe do, query something in the DB?
Hi @dima-sh-papaya , apologies for the delayed response. Yes - the readiness check will communicate with the database to verify connectivity. Did you manage to identify anything with Sentry?
Just for visibility, we believe that your comment in the other issue is related to this.
Hi @dima-sh-papaya , apologies for the delayed response. Yes - the readiness check will communicate with the database to verify connectivity. Did you manage to identify anything with Sentry?
hi, we figured out that the django healthcheck and overall connection handling is incompatible with pgbouncer that runs is default mode, something about session mode and persistent connections that it loses track of so it queries the db tables again etc'. sorry I don't remember many details because it was a long time ago but turning off pgbouncer solved it for us :)