cortex-tools
cortex-tools copied to clipboard
cortextool prepare produces invalid query
When running cortextool prepare on a query such as:
(sum by (node, resource) (kube_node_status_capacity{}))
* on(node) group_left(cluster, nodepool) nodepool:node:{} < %(threshold)0.2f
)
I'll produce
(sum by (node, resource, cluster) (kube_node_status_capacity{}))
* on(node,cluster) group_left(cluster, nodepool) nodepool:node:{} < %(threshold)0.2f
)
But this is invalid given: could not parse expression: 1:301: parse error: label \"cluster\" must not occur in ON and GROUP clause at once
I had another instance of this. The follow query intends to derive the cluster label
form the kube_node_annotations query. The grouping should not occur on the first part
of the query.
Intended query:
100 *
sum by (instance_id, nat_gateway_name, project_id) (
stackdriver_gce_instance_compute_googleapis_com_nat_port_usage
) /
sum by (instance_id, nat_gateway_name, project_id) (
stackdriver_gce_instance_compute_googleapis_com_nat_allocated_ports
)
* on(instance_id) group_left(node, cluster)
count by (instance_id, node, cluster) (
label_replace(
kube_node_annotations{annotation_container_googleapis_com_instance_id!=""},
'instance_id', '$1',
'annotation_container_googleapis_com_instance_id', '(.*)'
)
) > 90
This forced me to remove the cluster label from group_left(), eventually rendering
a wrong query:
100 *
sum by(instance_id, nat_gateway_name, project_id, cluster) (
stackdriver_gce_instance_compute_googleapis_com_nat_port_usage
) /
sum by(instance_id, nat_gateway_name, project_id, cluster) (
stackdriver_gce_instance_compute_googleapis_com_nat_allocated_ports
)
* on(instance_id, cluster) group_left(node)
count by(instance_id, node, cluster) (
label_replace(
kube_node_annotations{annotation_container_googleapis_com_instance_id!=""},
"instance_id", "$1",
"annotation_container_googleapis_com_instance_id", "(.*)"
)
) > 90
Perhaps we can have a HeadComment (for ex: # cortextool: skip rule aggregation to indicate the aggregation should not be applied.
wdyt?
Elaborating on @Duologic 's comments. The query is still wrong for us as the on(instance_id, **cluster**) part is causing issues - the metrics were generated in different clusters, therefore we don't want to join on the label.