flink
flink copied to clipboard
[FLINK-24941][datadog] Support Boolean gauges
What is the purpose of the change
Supporting Boolean values for Datadog metrics reporter. Closes FLINK-24941.
Motivation
At the moment, anyone who uses Datadog reporter cannot monitor backpressure. This is in contrast to other metric reporters such as Prometheus, which will happily report Booleans as either 0 or 1. This is detrimental since it forces you to view backpressure only inside Flink UI, which means you only can see the backpressure at the present moment (and cannot know if there was a backpressure 2 hours ago), and you also cannot create alarms for when the backpressure is high.
Supporting Boolean values will allow people to see backpressure in Datadog, and will support any additional gauges which are Boolean instead of Numbers.
Brief change log
-
DGauge
is now parameterized in typeT
and will map boolean values to either0
or1
appropriately
Verifying this change
(Please pick either of the following options)
This change added tests and can be verified as follows:
- Add a test case for DGauges that checks both booleans, and Numbers
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
@Public(Evolving)
: (no) - The serializers: (no)
- The runtime per-record code paths (performance sensitive): (no)
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
- The S3 file system connector: (no)
Documentation
- Does this pull request introduce a new feature? (yes)
- If yes, how is the feature documented? (not applicable) Other reporters (such as Prometheus) know how to deal with Boolean values but their documentation does not mention it, so no reason to mention it in the Datadog reporter. Seems to be trivial.
CI report:
- 189a1b5160dc1c5abad6a1b25f204b5cca9c66a5 Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:-
@flinkbot run azure
re-run the last Azure build
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community to review your pull request. We will use this comment to track the progress of the review.
Automated Checks
Last check on commit da938cbcc8379d3692e66b674db85a790663f8f8 (Fri Nov 19 07:48:28 UTC 2021)
Warnings:
- No documentation files were touched! Remember to keep the Flink docs up to date!
- This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.
Mention the bot in a comment to re-run the automated checks.
Review Progress
- ❓ 1. The [description] looks good.
- ❓ 2. There is [consensus] that the contribution should go into to Flink.
- ❓ 3. Needs [attention] from.
- ❓ 4. The change fits into the overall [architecture].
- ❓ 5. Overall code [quality] is good.
Please see the Pull Request Review Guide for a full explanation of the review process.Bot commands
The @flinkbot bot supports the following commands:
-
@flinkbot approve description
to approve one or more aspects (aspects:description
,consensus
,architecture
andquality
) -
@flinkbot approve all
to approve all aspects -
@flinkbot approve-until architecture
to approve everything untilarchitecture
-
@flinkbot attention @username1 [@username2 ..]
to require somebody's attention -
@flinkbot disapprove architecture
to remove an approval you gave earlier
What should we do to get this merged?
@rmetzger Any news?
Sorry, the label was added automatically. I guess @zentol would be the right committer to take a look at this PR.
@zentol Sorry to interrupt but it's been two weeks
@rmetzger Hi, why is this taking so long? Datadog is becoming the most popular monitoring system nowadays (the two previous companies I worked for used it) and I think tracking backpressure on your monitoring system is not optional. We currently run with a workaround in all our jobs and would like to eliminate the workaround. Moreover I am sure other people would like to see Flink backpressure in Datadog.
@rmetzger @zentol Hi, checking in to see where this is headed. Thanks!
So gauges are a bit annoying. Ideally we would only have Number gauges because really nothing else is properly supported by a majority of systems.
Boolean gauges are usually a mistake from the get-go as well.
Lets take the isBackpressured
metric. This metric tells you whether the task is back-pressured right now at this very moment.
That's a terrible metric to make any decision, and you should rather use backPressuredTimeMsPerSecond
because it's not susceptible to bad luck. isBackpressured
is only accurate if you are either 100% or 0% back-pressured; for everything in-between it's quite inaccurate (especially since the sampling interval is the reporting interval, aka typically in the order of seconds) (edited)
That's why the PR didn't receive any attention. In a way it'd only enable users to rely on bad metrics. Sure, consistency across reporters isn't a bad argument, but this consistency should still provide some real value.
Mapping booleans to ints isn't necessarily sound as well, because aggregating them isn't obvious. If we really wanted to supported gauges we'd ideally map them to a distribution I guess.