flagsmith icon indicating copy to clipboard operation
flagsmith copied to clipboard

Epic: Add ability to export product and performance metrics when running Flagsmith self-hosted

Open matthewelwell opened this issue 11 months ago • 5 comments

Specifically, we want people to be able to export metrics into Prometheus to monitor the performance of their Flagsmith installation, but there may also be product metrics that it would be useful to export as well.

matthewelwell avatar Jan 30 '25 16:01 matthewelwell

Hey, is there an alternative implementation for this

shashank-sarvam avatar Feb 27 '25 06:02 shashank-sarvam

@shashank-sarvam Flagsmith currently has no support for any Prometheus metrics, if that's what you are asking.

rolodato avatar Feb 27 '25 11:02 rolodato

As of 2.170.0 we now support Prometheus metrics. Documentation to follow.

matthewelwell avatar Apr 10 '25 10:04 matthewelwell

Flagsmith currently offers HTTP and task processing performance metrics in Prometheus format.

For a product metric use case, we can't use Prometheus labels directly as we're limited to low cardinality attributes. Currently, we're looking into supporting OpenMetrics as a way to provide high cardinality attributes such as user/project/environment identifiers as [exemplars] (1, 2). OpenMetrics format is supported by a variety of scrapers including OpenTelemetry Prometheus receiver and Influx's Telegraf.

Why not use OpenTelemetry metrics directly? We believe sticking to native Prometheus keeps our self-hosted support more flexible and lean. We are still compatible with OTel when we decide to implement it, as span/trace correlation can be implemented via exemplars.

khvn26 avatar Apr 25 '25 19:04 khvn26

While we have implemented a prometheus integration in the product, this epic should be kept open for tracking the larger solution around how we get product metrics from our self-hosted customer base to help with our product backlog development.

matthewelwell avatar Jun 17 '25 14:06 matthewelwell

is it already possible to inspect somehow the latency of flagsmith api, in terms of where the latency takes place? it seems that without opentelemetry support, the only way to find out whether it gets slow responses from DB or gets slow somewhere on disk IO, or instance networking etc' is by correlating tcpdump packets order of data traveling back and forth to db instance (when on separate instances) and the node linux metrics? opentelemetry would really help to spot the problematic span immediately when networking or external service like DB is involved 🤔 +1 to this feature 😊

dima-sh-papaya avatar Jul 29 '25 09:07 dima-sh-papaya

@dima-sh-papaya we use Sentry for our SaaS deployment of Flagsmith. While it's not documented, you can use Sentry for this level of detail if you want it by populating the settings you can see here.

matthewelwell avatar Jul 29 '25 09:07 matthewelwell

@dima-sh-papaya we use Sentry for our SaaS deployment of Flagsmith. While it's not documented, you can use Sentry for this level of detail if you want it by populating the settings you can see here.

awesome, that might help. We see random latency of /health/readiness which coincide with /api/v1/identities latency (as if all flagsmith api gets slow so any request incoming at that time will have long response) but cannot see any compute resource pressure not in the k8s nodes or the RDS database, so a bit lost in terms of how to profile or investigate this random latency. For some reason the "liveness" probe is within 5ms all the time, but "readiness" probe has the occasional issue, what does the readiness probe do, query something in the DB?

Image

dima-sh-papaya avatar Jul 30 '25 11:07 dima-sh-papaya

Hi @dima-sh-papaya , apologies for the delayed response. Yes - the readiness check will communicate with the database to verify connectivity. Did you manage to identify anything with Sentry?

matthewelwell avatar Sep 09 '25 10:09 matthewelwell

Just for visibility, we believe that your comment in the other issue is related to this.

khvn26 avatar Sep 10 '25 11:09 khvn26

Hi @dima-sh-papaya , apologies for the delayed response. Yes - the readiness check will communicate with the database to verify connectivity. Did you manage to identify anything with Sentry?

hi, we figured out that the django healthcheck and overall connection handling is incompatible with pgbouncer that runs is default mode, something about session mode and persistent connections that it loses track of so it queries the db tables again etc'. sorry I don't remember many details because it was a long time ago but turning off pgbouncer solved it for us :)

dima-sh-papaya avatar Sep 11 '25 07:09 dima-sh-papaya