prometheus.erl
prometheus.erl copied to clipboard
Crashes with quantile_summary metrics and multiple schedulers
We're getting persistent crashes when emitting quantile_summary
metrics collected from multiple schedulers:
** exception error: no match of right hand side value true
in function quantile_estimator:merge/2 (/tmp/prometheus.erl/_build/default/lib/quantile_estimator/src/quantile_estimator.erl, line 141)
in call from quantile_estimator:compress/4 (/tmp/prometheus.erl/_build/default/lib/quantile_estimator/src/quantile_estimator.erl, line 133)
in call from quantile_estimator:compress/4 (/tmp/prometheus.erl/_build/default/lib/quantile_estimator/src/quantile_estimator.erl, line 133)
in call from quantile_estimator:compress/1 (/tmp/prometheus.erl/_build/default/lib/quantile_estimator/src/quantile_estimator.erl, line 106)
in call from prometheus_quantile_summary:'-values/2-fun-0-'/3 (/tmp/prometheus.erl/src/metrics/prometheus_quantile_summary.erl, line 309)
in call from lists:foldl_1/3 (lists.erl, line 1355)
in call from prometheus_quantile_summary:values/2 (/tmp/prometheus.erl/src/metrics/prometheus_quantile_summary.erl, line 306)
There appears to be two contributing causes:
-
quantile_estimator:compress/1
and its helper functionmerge/2
want to assert that the input is as expected (i.e., asquantile_estimator
itself would have produced), but the assertion on line 141 fails. -
prometheus_quantile_summary
records observations from different schedulers under different keys in its ETS table (to reduce contention). Then it wants to combine observations withquantile_merge/2
, which just appends the observations from the different schedulers (the++
on line 497), before passing that toquantile_estimator:compress/1
.
The problem, as far as I understand this code, is that the append in quantile_merge/2
is not guaranteed to result in data that quantile_estimator:merge/2
considers to be well-formed.
I'm attaching a test case (zipped due to github limitations). It's based on the actual contents of the prometheus_quantile_summary_table
ETS from one of our crashes, reduced to the bare minimum that still reproduced the crash. (I've tried to come up with a test case that only used the public APIs, but failed, possibly due to the non-determinism described below.)
bug.erl.zip
Note: quantile_summary_metrics.erl
contains two non-deterministic constructs (retrieving multiple entries from a set
ETS table) and I found it necessary to eliminate that non-determinism in order to have a reliable test case:
diff --git a/src/metrics/prometheus_quantile_summary.erl b/src/metrics/prometheus_quantile_summary.erl
index 9ffdd3e..e860bb0 100644
--- a/src/metrics/prometheus_quantile_summary.erl
+++ b/src/metrics/prometheus_quantile_summary.erl
@@ -290,7 +290,7 @@ value(Registry, Name, LabelValues) ->
[],
['$$']}]) of
[] -> undefined;
- Values -> {Count, Sum, QE} = reduce_values(Values),
+ Values -> {Count, Sum, QE} = reduce_values(lists:sort(Values)),
{Count, prometheus_time:maybe_convert_to_du(DU, Sum), quantile_values(QE, QNs)}
end.
@@ -403,7 +403,7 @@ insert_metric(Registry, Name, LabelValues, Value, ConflictCB) ->
end.
load_all_values(Registry, Name) ->
- ets:match(?TABLE, {{Registry, Name, '$1', '_'}, '$2', '$3', '$4', '_'}).
+ lists:sort(ets:match(?TABLE, {{Registry, Name, '$1', '_'}, '$2', '$3', '$4', '_'})).
get_configuration(Registry, Name) ->
MF = prometheus_metric:check_mf_exists(?TABLE, Registry, Name),
0001-eliminate-non-deterministic-behaviour.patch.txt
This is with prometheus.erl
v4.10.0
, any recent OTP (24.3.4.14, 25.3.2.7, 26.1.2), and Linux/x86_64 (AL2, Fedora 38).
The crash reported in https://github.com/deadtrickster/prometheus.erl/issues/146 looks surprisingly similar to the one we get.
I can confirm the issue. The merging logic is flawed, because, just as @mikpe suggested, concatenating data from two quantile summaries does not result in a valid quantile summary because of repeated ranks, wrong deltas, etc. There is also an issue that the ETS operations are not atomic, and with more than ?LENGTH
(by default 16) schedulers there might be race conditions.
As a result prometheus_quantile_summary
is currently unusable if there is more than one scheduler. A way to fix this might be to:
- Either implement proper summary merging logic, or just insert everything row by row on merge (starting with the largest summary).
- Also make sure that
?LENGTH
is not smaller thanerlang:system_info(schedulers_online)
.
Please open a PR with a test and fix and I will see if I can get Ilya or someone else to review and merge.
If I had a fix I would have submitted it by now.
Our workaround was to ban any usage of prometheus.erl
's quantile_summary
metrics. For now we use histograms where we need that sort of thing, but they're not ideal. We have an internal ticket to reimplement quantile summaries from scratch, if and when we have to have them, but that hasn't happened yet.