integrations-core icon indicating copy to clipboard operation
integrations-core copied to clipboard

[proxysql] Generate metric for all pairs of hostgroup/status (backend)

Open frivoire opened this issue 11 months ago • 7 comments

What does this PR do?

Ensure that the count metric is always generated, even when 0 backend exists in that status & hostgroup at the moment.

Motivation

Today, when a hostgroup has no online backend server for example, the metric with tag status=ONLINE+hostgroup=42 is not generated at all (no value for sometime, until a backend is online). But having no online backend is an important piece of information to monitor a proxysql. Example: a dashboard showing the nb of online backend servers per hostgroup is not giving a good observability if not showing the 0 value, because it has a different meaning than no value :(

NB: the list of hostgroups is dynamic (because it depends on config, thus different for each deployment), and the list of status is hardcoded (because fixed in proxysql, cf doc)

Tests

I've only tested the SQL query alone on a real proxysql:

Before:

ProxySQLadm> SELECT hostgroup, status, COUNT(*)
FROM stats_mysql_connection_pool GROUP BY hostgroup, status;
+-----------+--------------+----------+
| hostgroup | status       | COUNT(*) |
+-----------+--------------+----------+
| 10        | ONLINE       | 1        |
| 10        | SHUNNED      | 2        |
| 20        | ONLINE       | 3        |
| 30        | ONLINE       | 2        |
| 9999      | OFFLINE_HARD | 3        |
+-----------+--------------+----------+

Now:

ProxySQLadm> SELECT refpairs.hostgroup, refpairs.status, count(stats.srv_host) as count
             FROM (
                 SELECT DISTINCT(hostgroup), refstatuses.status
                 FROM stats_mysql_connection_pool
                 JOIN (
                     select 'ONLINE' as status union all select 'SHUNNED' as status union all
                     select 'OFFLINE_SOFT' as status union all select 'OFFLINE_HARD' as status
                 ) refstatuses
             ) refpairs
             LEFT JOIN stats_mysql_connection_pool stats ON stats.hostgroup=refpairs.hostgroup and stats.status=refpairs.status
             GROUP BY refpairs.hostgroup, refpairs.status;
+-----------+--------------+-------+
| hostgroup | status       | count |
+-----------+--------------+-------+
| 10        | OFFLINE_HARD | 0     |
| 10        | OFFLINE_SOFT | 0     |
| 10        | ONLINE       | 1     |
| 10        | SHUNNED      | 2     |
| 20        | OFFLINE_HARD | 0     |
| 20        | OFFLINE_SOFT | 0     |
| 20        | ONLINE       | 3     |
| 20        | SHUNNED      | 0     |
| 30        | OFFLINE_HARD | 0     |
| 30        | OFFLINE_SOFT | 0     |
| 30        | ONLINE       | 2     |
| 30        | SHUNNED      | 0     |
| 9999      | OFFLINE_HARD | 3     |
| 9999      | OFFLINE_SOFT | 0     |
| 9999      | ONLINE       | 0     |
| 9999      | SHUNNED      | 0     |
+-----------+--------------+-------+

=> same values than before (when not zero) ✅ => and all other pairs of hostgroup+status are now also generated as 0 🆕

Review checklist (to be filled by reviewers)

  • [ ] Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • [ ] Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • [ ] If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

frivoire avatar Jan 20 '25 17:01 frivoire

NB: I've also notified the Datadog support about this issue => request #1997802

frivoire avatar Jan 20 '25 17:01 frivoire

1 - can you add the changelog entry for this change: ddev release changelog new

2 - can you fix the linter: ddev test proxysql --fmt

HadhemiDD avatar Jan 23 '25 14:01 HadhemiDD

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 88.74%. Comparing base (6782bb7) to head (ac604ed). Report is 828 commits behind head on master.

Additional details and impacted files
Flag Coverage Δ
activemq ?
cassandra ?
hive ?
hivemq ?
hudi ?
ignite ?
jboss_wildfly ?
kafka ?
presto ?
solr ?

Flags with carried forward coverage won't be shown. Click here to find out more.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Jan 24 '25 11:01 codecov[bot]

Hello,

1 - can you add the changelog entry for this change: ddev release changelog new 2 - can you fix the linter: ddev test proxysql --fmt

I'm not a Datadog employee, I know nothing about your internal process here 😬 I tried (installing ddev: brew install ddev/ddev/ddev), and running it from inside this git repo's directory locally, but:

flo@macbook:~/git/_external/datadog-integrations-core$ ddev release changelog new
Error: unknown command "release" for "ddev"
Run 'ddev --help' for usage.

$ ddev --version
ddev version v1.24.2

So, can you handle this for us ? (we are just customers trying to fix a bug in Datadog 😆) Or at least, provide a doc explaining how to do what you're asking me 🙏

frivoire avatar Jan 29 '25 18:01 frivoire

The datadog agent is designed to only collect metrics when present, so a missing value do not translate into 0 but rather into nothing (no value, no tags) until it gets generated. So this is not a bug but more of of a feature request.

We can make the change on your behalf(might take some time though), however, after having a second look at it and checking for other similar requests (none found) , we believe it would be better to hide this behaviour behind a config option for this integration that is disabled by default but can be enabled for customers who want it. The reason behind is to avoid generating the extra metrics for other customers and break the current behaviour.

As for ddev, it is not restricted to datadog internal process, for future references you can check these docs: ddev dev guide

HadhemiDD avatar Feb 03 '25 13:02 HadhemiDD

The datadog agent is designed to only collect metrics when present, so a missing value do not translate into 0 but rather into nothing (no value, no tags) until it gets generated. So this is not a bug but more of of a feature request.

Here, the issue is not the behavior of the datadog agent that you described, but in the way that metrics are "exposed" by proxysql: And having no healthly "backend" servers to route the traffic to is really an important issue for a proxysql instance, because its main feature is being a reverse-proxy, a bit like haproxy. So, here specifically, yes, it is a bug of not having those metric points : it's not showing properly the state of the system (and a critical time), so a pretty partial observability tool ;)

To explain visually the issue, here is a screenshot of the metric during a short incident (~2min of no backend servers healthly, so no server with status=online in hostgroup=10): Metrics_Explorer___Datadog => can you spot the incident on this graph ?

No 😢 But it's there, between 9h38 and 9h40, we had 0 online backend servers for this hostgroup. And we cannot see it on Datadog because the interpolation mechanism has generated "fake" values (1) during the ~2 min gap of metric from the agent, so everything seems normal on graph.

So, you'll tell me, let's remove interpolation (using .fill(null)) => the graph still shows the same flat line (at 1), not showing the issue, because the graph is linking the "real" points together, even with a 2min gap between them

So the bug is really that having 0 backend is not the same as having no metric, it's really bringing information: there is no healthly backend, which explains that the sql traffic can't be routed. So it's really important !

Here is the expected graph: Metrics_Explorer___Datadog

We can make the change on your behalf(might take some time though), however, after having a second look at it and checking for other similar requests (none found) , we believe it would be better to hide this behaviour behind a config option for this integration that is disabled by default but can be enabled for customers who want it. The reason behind is to avoid generating the extra metrics for other customers and break the current behaviour.

A config option is good for me. But I have no idea how to implement it, so please do 🙏

As for ddev, it is not restricted to datadog internal process, for future references you can check these docs: ddev dev guide

OK thanks, I'll keep that in mind. But don't hesitate to do it, it will be much faster for you or your colleagues, you already have everything in place.

frivoire avatar Feb 17 '25 08:02 frivoire

Any news here ? The metric is still buggy here 😬

frivoire avatar Jun 03 '25 15:06 frivoire

Hey @frivoire sorry for the delay- I've implemented the fix in this PR: https://github.com/DataDog/integrations-core/pull/20582, let me know if this looks good to you!

sarah-witt avatar Jun 26 '25 13:06 sarah-witt

Hi @frivoire I'm going to merge this PR: https://github.com/DataDog/integrations-core/pull/20582/files and close this one, but please reach out if you have any questions or concerns with the approach!

sarah-witt avatar Jun 27 '25 15:06 sarah-witt