harvest
harvest copied to clipboard
Duplicate series error during join queries
Thanks @ybizeul for reporting.
From the screenshot, it seems that the poller port has changed. We should investigate whether the instance label can be ignored during join queries in Prometheus/Grafana.
Case 1: Instant Query Failure
For the problem mentioned above, we encountered an instant query failure. This can only happen if the same poller is being monitored on different ports in Harvest. Here is an example:
Suppose we are publishing the following metrics to Prometheus. The port
field is used to simulate instance
label with different values:
volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12990"} 2.45
volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12991"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12990"} 1
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12991"} 1
When we run the following Prometheus instant query:
volume_avg_latency
* on(aggr,volume) group_right
volume_labels
We receive the following error:
Error executing query: found duplicate series for the match group {aggr="EPICaggr", volume="DB1"} on the left hand-side of the operation: [...]; many-to-many matching not allowed: matching labels must be unique on one side.
Since this is not a valid use case in the field, there is no need to handle this situation.
Case 2: Range Query Failure
During situations such as node move or volume move, joins may fail. These need to be tackled according to the query, which may involve ignoring certain labels to fix the issue.
Case 3: Poller Port Change
Another scenario occurs when the poller port changes over time due to poller addition or deletion, resulting in a change to the Prometheus instance
label. For simulation purposes, we use the port
label.
Initially, Prometheus scrapes the following data:
volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12990"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12990"} 1
After some time, if the poller port changes to 12991
, it publishes the following data:
volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12991"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12991"} 1
Running the range query volume_labels
in Prometheus will result in a change in color for this metric and a duplicate listing of the metric in the Grafana panel due to the change in port.
To address this, we have two options:
-
Modify the query to exclude the
port
label. Here is the adjusted query:label_replace(volume_labels,"port", "", "port", ".*")
-
Drop the
port
label using Prometheus label rules in the Prometheus configuration.
If we apply solution 1 to topk
queries, the modification would look like this:
Before:
volume_labels
and
topk(5, avg_over_time(volume_labels[3h] @ end()))
After:
label_replace(volume_labels,"port", "", "port", ".*")
and
topk(5, avg_over_time(label_replace(volume_labels,"port", "", "port", ".*")[3h:] @ end()))
Continuation to Case 3: Poller Port Change
To resolve color/duplicate instance issues in panels, we can use the following query pattern instead of label_replace
, which keeps the queries easy to understand.
Original Query:
volume_read_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}
and
topk($TopResources, avg_over_time(volume_read_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}[3h] @ end()))
New Query:
sum by (datacenter, cluster, svm, volume) (
volume_read_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}
)
and
topk($TopResources, sum by (datacenter, cluster, svm, volume) (
avg_over_time(
volume_read_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}[$__rate_interval] @ end()
)
))
We decided there are no changes required