[proxysql] Generate metric for all pairs of hostgroup/status (backend)
What does this PR do?
Ensure that the count metric is always generated, even when 0 backend exists in that status & hostgroup at the moment.
Motivation
Today, when a hostgroup has no online backend server for example, the metric with tag status=ONLINE+hostgroup=42 is not generated at all (no value for sometime, until a backend is online).
But having no online backend is an important piece of information to monitor a proxysql.
Example: a dashboard showing the nb of online backend servers per hostgroup is not giving a good observability if not showing the 0 value, because it has a different meaning than no value :(
NB: the list of hostgroups is dynamic (because it depends on config, thus different for each deployment), and the list of status is hardcoded (because fixed in proxysql, cf doc)
Tests
I've only tested the SQL query alone on a real proxysql:
Before:
ProxySQLadm> SELECT hostgroup, status, COUNT(*)
FROM stats_mysql_connection_pool GROUP BY hostgroup, status;
+-----------+--------------+----------+
| hostgroup | status | COUNT(*) |
+-----------+--------------+----------+
| 10 | ONLINE | 1 |
| 10 | SHUNNED | 2 |
| 20 | ONLINE | 3 |
| 30 | ONLINE | 2 |
| 9999 | OFFLINE_HARD | 3 |
+-----------+--------------+----------+
Now:
ProxySQLadm> SELECT refpairs.hostgroup, refpairs.status, count(stats.srv_host) as count
FROM (
SELECT DISTINCT(hostgroup), refstatuses.status
FROM stats_mysql_connection_pool
JOIN (
select 'ONLINE' as status union all select 'SHUNNED' as status union all
select 'OFFLINE_SOFT' as status union all select 'OFFLINE_HARD' as status
) refstatuses
) refpairs
LEFT JOIN stats_mysql_connection_pool stats ON stats.hostgroup=refpairs.hostgroup and stats.status=refpairs.status
GROUP BY refpairs.hostgroup, refpairs.status;
+-----------+--------------+-------+
| hostgroup | status | count |
+-----------+--------------+-------+
| 10 | OFFLINE_HARD | 0 |
| 10 | OFFLINE_SOFT | 0 |
| 10 | ONLINE | 1 |
| 10 | SHUNNED | 2 |
| 20 | OFFLINE_HARD | 0 |
| 20 | OFFLINE_SOFT | 0 |
| 20 | ONLINE | 3 |
| 20 | SHUNNED | 0 |
| 30 | OFFLINE_HARD | 0 |
| 30 | OFFLINE_SOFT | 0 |
| 30 | ONLINE | 2 |
| 30 | SHUNNED | 0 |
| 9999 | OFFLINE_HARD | 3 |
| 9999 | OFFLINE_SOFT | 0 |
| 9999 | ONLINE | 0 |
| 9999 | SHUNNED | 0 |
+-----------+--------------+-------+
=> same values than before (when not zero) ✅ => and all other pairs of hostgroup+status are now also generated as 0 🆕
Review checklist (to be filled by reviewers)
- [ ] Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
- [ ] Add the
qa/skip-qalabel if the PR doesn't need to be tested during QA. - [ ] If you need to backport this PR to another branch, you can add the
backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged
NB: I've also notified the Datadog support about this issue => request #1997802
1 - can you add the changelog entry for this change: ddev release changelog new
2 - can you fix the linter: ddev test proxysql --fmt
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 88.74%. Comparing base (
6782bb7) to head (ac604ed). Report is 828 commits behind head on master.
Additional details and impacted files
| Flag | Coverage Δ | |
|---|---|---|
| activemq | ? |
|
| cassandra | ? |
|
| hive | ? |
|
| hivemq | ? |
|
| hudi | ? |
|
| ignite | ? |
|
| jboss_wildfly | ? |
|
| kafka | ? |
|
| presto | ? |
|
| solr | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
Hello,
1 - can you add the changelog entry for this change: ddev release changelog new 2 - can you fix the linter: ddev test proxysql --fmt
I'm not a Datadog employee, I know nothing about your internal process here 😬
I tried (installing ddev: brew install ddev/ddev/ddev), and running it from inside this git repo's directory locally, but:
flo@macbook:~/git/_external/datadog-integrations-core$ ddev release changelog new
Error: unknown command "release" for "ddev"
Run 'ddev --help' for usage.
$ ddev --version
ddev version v1.24.2
So, can you handle this for us ? (we are just customers trying to fix a bug in Datadog 😆) Or at least, provide a doc explaining how to do what you're asking me 🙏
The datadog agent is designed to only collect metrics when present, so a missing value do not translate into 0 but rather into nothing (no value, no tags) until it gets generated. So this is not a bug but more of of a feature request.
We can make the change on your behalf(might take some time though), however, after having a second look at it and checking for other similar requests (none found) , we believe it would be better to hide this behaviour behind a config option for this integration that is disabled by default but can be enabled for customers who want it. The reason behind is to avoid generating the extra metrics for other customers and break the current behaviour.
As for ddev, it is not restricted to datadog internal process, for future references you can check these docs: ddev dev guide
The datadog agent is designed to only collect metrics when present, so a missing value do not translate into 0 but rather into nothing (no value, no tags) until it gets generated. So this is not a bug but more of of a feature request.
Here, the issue is not the behavior of the datadog agent that you described, but in the way that metrics are "exposed" by proxysql: And having no healthly "backend" servers to route the traffic to is really an important issue for a proxysql instance, because its main feature is being a reverse-proxy, a bit like haproxy. So, here specifically, yes, it is a bug of not having those metric points : it's not showing properly the state of the system (and a critical time), so a pretty partial observability tool ;)
To explain visually the issue, here is a screenshot of the metric during a short incident (~2min of no backend servers healthly, so no server with status=online in hostgroup=10):
=> can you spot the incident on this graph ?
No 😢 But it's there, between 9h38 and 9h40, we had 0 online backend servers for this hostgroup.
And we cannot see it on Datadog because the interpolation mechanism has generated "fake" values (1) during the ~2 min gap of metric from the agent, so everything seems normal on graph.
So, you'll tell me, let's remove interpolation (using .fill(null))
=> the graph still shows the same flat line (at 1), not showing the issue, because the graph is linking the "real" points together, even with a 2min gap between them
So the bug is really that having 0 backend is not the same as having no metric, it's really bringing information: there is no healthly backend, which explains that the sql traffic can't be routed. So it's really important !
Here is the expected graph:
We can make the change on your behalf(might take some time though), however, after having a second look at it and checking for other similar requests (none found) , we believe it would be better to hide this behaviour behind a config option for this integration that is disabled by default but can be enabled for customers who want it. The reason behind is to avoid generating the extra metrics for other customers and break the current behaviour.
A config option is good for me. But I have no idea how to implement it, so please do 🙏
As for ddev, it is not restricted to datadog internal process, for future references you can check these docs: ddev dev guide
OK thanks, I'll keep that in mind. But don't hesitate to do it, it will be much faster for you or your colleagues, you already have everything in place.
Any news here ? The metric is still buggy here 😬
Hey @frivoire sorry for the delay- I've implemented the fix in this PR: https://github.com/DataDog/integrations-core/pull/20582, let me know if this looks good to you!
Hi @frivoire I'm going to merge this PR: https://github.com/DataDog/integrations-core/pull/20582/files and close this one, but please reach out if you have any questions or concerns with the approach!