mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Mimir mixin Disk space utilization panels broken for mimir helm chart

Open jmichalek132 opened this issue 1 year ago • 12 comments

Describe the bug

The disk space utilization panels in the mimir mixin don't work with mimir deployed using the mimir-distributed helm chart.

Query example from:

image

max by(persistentvolumeclaim) (
  kubelet_volume_stats_used_bytes{cluster=~"$cluster", namespace=~"$namespace"} /
  kubelet_volume_stats_capacity_bytes{cluster=~"$cluster", namespace=~"$namespace"}
)
and
count by(persistentvolumeclaim) (
  kube_persistentvolumeclaim_labels{
    cluster=~"$cluster", namespace=~"$namespace",
    label_name=~"(ingester).*"
  }
)

Problematic part is label_name=~"(ingester).*" on the kube_persistentvolumeclaim_labels metric coming from.

There are 2 issues with it;

  • If you the kube-state-metrics service monitor from metamonitoring it will drop the kube_persistentvolumeclaim_labels metric due to this metric relabeling rule
  • The PVCs produced don't have this label:
  labels:
    app.kubernetes.io/component: ingester
    app.kubernetes.io/instance: metrics
    app.kubernetes.io/name: mimir
    rollout-group: ingester
    zone: zone-a

To Reproduce

Steps to reproduce the behavior:

  1. Start mimir distributed helm chart
  2. Deploy the mimir mixin

Expected behavior

For Disk space utilization panels to work.

Environment

  • Infrastructure: Kubernetes, AKS
  • Deployment tool: Helm

Additional Context

Not sure what would be the best way to fix this. Things that come to mind

  • Update the service monitor to not drop more used metrics by the mixin
  • Allow configuring which labels is used instead of label_name in those panels

Willing to submit a PR to address this after feedback.

jmichalek132 avatar Mar 01 '24 10:03 jmichalek132

Keeping the kube_persistentvolumeclaim_labels metric just for the pods from mimir looks tricky. Can we use the 'app.kubernetess.io/name: mimir' inside the relabelling config to pick this up? To avoid scraping other mimir clusters in the same k8s cluster we can add another 'keep' relabelling which checks the namespace label (and need to make sure it's present on all currently collected metrics)

Regarding the second problem of using different labels - is it possible to use label_replace in the panel query so we take either 'label_rollout_group' or 'label_name' - whichever is present? I.e. replace label_name woth the other only if the other is non-empty, maybe theres something with regex we can do?

dimitarvdimitrov avatar Mar 01 '24 15:03 dimitarvdimitrov

@dimitarvdimitrov I'm just facing this issue and would using the app.kubernetes.io/component: ingester label not be a better choice than rollout group for the query? I can work on that as I really need this to be fixed

QuentinBisson avatar Apr 17 '24 21:04 QuentinBisson

using app.kubernetess.io/name: mimir makes this slightly more reusable - we don't have to update the scraping rules every time a new component has disk (or we rename the component or add disk to an existing component, etc). If that's not possible, then app.kubernetes.io/component should also suffice

dimitarvdimitrov avatar Apr 22 '24 13:04 dimitarvdimitrov

I think the issue with using app.kubernetess.io/name is that it would not work on specific write dashboard because it would write thé data of all Mimir components instead of just the ingester

QuentinBisson avatar Apr 23 '24 08:04 QuentinBisson

See PR here https://github.com/grafana/mimir/pull/7968

QuentinBisson avatar Apr 25 '24 12:04 QuentinBisson

Let me know how I can speed things up :)

QuentinBisson avatar Apr 29 '24 10:04 QuentinBisson

Using a common label on the kube_persistentvolumeclaim_labels selector

My suggestion in the issue was to add support for both labels in panel via label_replace or label_join promQL function. But I realized this won't work because we cannot filter on the series labels outside of the vector selector (kube_persistentvolumeclaim_labels{...})

Another option for solving the selector problem is to add the standard kubernetes labels (app.kubernetes.io/*) to the jsonnet mixin and use that in the dashboards like the ones here

https://github.com/grafana/mimir/blob/f573a03208fb076f8d65375c72572528e343912f/operations/helm/charts/mimir-distributed/templates/_helpers.tpl#L226-L237

dimitarvdimitrov avatar May 03 '24 17:05 dimitarvdimitrov

@dimitarvdimitrov I'm not sure what you mean by adding the label to the jsonnet mixin, I'm fine implementing a working solution though :D

The main issue with using the kube_persistentvolumeclaim_labels is that the newer versions of kube-state-metrics do not expose any labels by defaults (since 2.11 I think) and they need to be explicitely asked for

QuentinBisson avatar May 03 '24 17:05 QuentinBisson

adding the label to the jsonnet mixin

I meant adding the label to the resources created by the jsonnet library to deploy Mimir. This will help with having a single label selector in the promQL query because it will match both jsonnet and helm deployments because they will have some label in common

The main issue with using the kube_persistentvolumeclaim_labels is that the newer versions of kube-state-metrics do not expose any labels by defaults (since 2.11 I think) and they need to be explicitely asked for

That doesn't sound great. So the metric is just empty by default? I couldn't find this change in the changelog. Do you have a link to the PR or changelog entry?

dimitarvdimitrov avatar May 06 '24 10:05 dimitarvdimitrov

I was wrong, this happened in kube-state-metrics 2.10 https://github.com/kubernetes/kube-state-metrics/releases/tag/v2.10.0 (cf. the top message)

QuentinBisson avatar May 06 '24 10:05 QuentinBisson

Regarding the fix I would assume this would be needed under operations/mimir? In that case I'm not sure I would be the best to fix it because I'm definitely lost in this folder

QuentinBisson avatar May 06 '24 11:05 QuentinBisson

@dimitarvdimitrov what do you think about using something like

kubelet_volume_stats_used_bytes{cluster_id=~"$cluster", namespace=~"$namespace", persistentvolumeclaim=~".*(ingester)-.*"} / \n kubelet_volume_stats_capacity_bytes{cluster_id=~"$cluster", namespace=~"$namespace", persistentvolumeclaim=~".*(ingester)-.*"}\n)\n"

instead of relying on the kube-state-metrics labels metric?

QuentinBisson avatar May 14 '24 20:05 QuentinBisson