harvest icon indicating copy to clipboard operation
harvest copied to clipboard

Duplicate series error during join queries

Open rahulguptajss opened this issue 11 months ago • 2 comments

Thanks @ybizeul for reporting.

image

From the screenshot, it seems that the poller port has changed. We should investigate whether the instance label can be ignored during join queries in Prometheus/Grafana.

rahulguptajss avatar Mar 27 '24 09:03 rahulguptajss

Case 1: Instant Query Failure

For the problem mentioned above, we encountered an instant query failure. This can only happen if the same poller is being monitored on different ports in Harvest. Here is an example:

Suppose we are publishing the following metrics to Prometheus. The port field is used to simulate instance label with different values:

volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12990"} 2.45
volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12991"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12990"} 1
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12991"} 1

When we run the following Prometheus instant query:

volume_avg_latency
* on(aggr,volume) group_right
volume_labels

We receive the following error:

Error executing query: found duplicate series for the match group {aggr="EPICaggr", volume="DB1"} on the left hand-side of the operation: [...]; many-to-many matching not allowed: matching labels must be unique on one side.

Since this is not a valid use case in the field, there is no need to handle this situation.

Case 2: Range Query Failure

During situations such as node move or volume move, joins may fail. These need to be tackled according to the query, which may involve ignoring certain labels to fix the issue.

Case 3: Poller Port Change

Another scenario occurs when the poller port changes over time due to poller addition or deletion, resulting in a change to the Prometheus instance label. For simulation purposes, we use the port label.

Initially, Prometheus scrapes the following data:

volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12990"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12990"} 1

After some time, if the poller port changes to 12991, it publishes the following data:

volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12991"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12991"} 1

Running the range query volume_labels in Prometheus will result in a change in color for this metric and a duplicate listing of the metric in the Grafana panel due to the change in port.

image

To address this, we have two options:

  1. Modify the query to exclude the port label. Here is the adjusted query:

    label_replace(volume_labels,"port", "", "port", ".*")
    
  2. Drop the port label using Prometheus label rules in the Prometheus configuration.

If we apply solution 1 to topk queries, the modification would look like this:

Before:

volume_labels
  and 
topk(5, avg_over_time(volume_labels[3h] @ end()))

After:

label_replace(volume_labels,"port", "", "port", ".*")
and
topk(5, avg_over_time(label_replace(volume_labels,"port", "", "port", ".*")[3h:] @ end()))
image

rahulguptajss avatar Apr 24 '24 11:04 rahulguptajss

Continuation to Case 3: Poller Port Change

To resolve color/duplicate instance issues in panels, we can use the following query pattern instead of label_replace, which keeps the queries easy to understand.

Original Query:

volume_read_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}
and
topk($TopResources, avg_over_time(volume_read_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}[3h] @ end()))

New Query:

sum by (datacenter, cluster, svm, volume) (
  volume_read_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}
)
and
topk($TopResources, sum by (datacenter, cluster, svm, volume) (
  avg_over_time(
    volume_read_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}[$__rate_interval] @ end()
  )
))

rahulguptajss avatar Jun 27 '24 07:06 rahulguptajss

We decided there are no changes required

cgrinds avatar Jul 23 '24 13:07 cgrinds