helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

Grafana Default Dashboards - No Data

Open stephenrob opened this issue 3 years ago • 4 comments

I've deployed a VictoriaMetrics Cluster using the operator and have this working fine.

I've just deployed the grafana dashboards using the following values file to generate the dashboards using the k8s-stack helm chart and output them to a file using dry run and the following command.

helm install victoria-metrics-k8s-stack vm/victoria-metrics-k8s-stack -f values.yaml -n cluster-monitoring --dry-run > /tmp/vm-k8s.yaml

values.yaml

grafana:
  enabled: true
  sidecar:
    datasources:
      enabled: true
      createVMReplicasDatasources: false
    dashboards:
      enabled: true
      multicluster: true

  additionalDataSources: []

  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/default

  dashboards:
    default:
      victoriametrics:
        url: https://raw.githubusercontent.com/VictoriaMetrics/VictoriaMetrics/master/dashboards/victoriametrics.json
      vmagent:
        url: https://raw.githubusercontent.com/VictoriaMetrics/VictoriaMetrics/master/dashboards/vmagent.json
      nodeexporter:
        gnetId: 1860
        revision: 22
        datasource: VictoriaMetrics

  defaultDashboardsEnabled: true

This generates the dashboards fine and allows them to be loaded in to grafana. The problem is some of the metrics don't show any data unless the metric query is modified.

An example is Cluster Memory Utilisation on the Kubernetes / Computer Resources / Cluster dashboard:

With the metrics query as generated:

1 - sum(:node_memory_MemAvailable_bytes:sum{cluster="$cluster"}) / sum(node_memory_MemTotal_bytes{cluster="$cluster"})

The result is:

CleanShot 2021-07-05 at 17 11 30

When the metrics query is changed to:

1 - sum(node_memory_MemAvailable_bytes{cluster="$cluster"}) / sum(node_memory_MemTotal_bytes{cluster="$cluster"})

Then the result is

CleanShot 2021-07-05 at 17 19 03

Which seems accurate for the current use on our monitoring cluster.

There are many dashboards that have the same problem where the query syntax needs changing, Is this something specific to the VictoriaMetrics chart and datasource or is this an upstream issue with the kube-prometheus dashboards and how they are sync'd using sync_grafana_dashboards.py

stephenrob avatar Jul 05 '21 16:07 stephenrob

Hello, many default kubernetes dashboards depends on recording rules, which executed and ingested to the storage by VMAlert.

So, for cluster case, you have to edit vmalert configuration. For genereated config its located at output of victoria-metrics-k8s-stack/templates/victoria-metrics-operator/vmalert.yaml, change vmsingle to vmcluster - select and insert nodes.

It should fix this issue.

Btw, as far as i know, soon k8s-stack will support cluster version.

f41gh7 avatar Jul 06 '21 07:07 f41gh7

Thanks @f41gh7 i'd overlooked the VMRule and VMAlert configs when copying them to our kustomize base assuming they were linked to AlertManager which I wasn't ready to setup. Now i've added VMAlert, VMAlertmanager and all the VMRule it's all working a lot better.

Just another VMrule to track down to fix the CPU/Memory Requests on the cluster dashboard and should be good to go.

CleanShot 2021-07-06 at 12 10 42

stephenrob avatar Jul 06 '21 11:07 stephenrob

Looks like i have the correct rule loaded just nothing recorded against it yet.

Rule CRD
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
namespace: victoria-metrics
name: library-systems-monitoring-k8s
spec:
groups:
- name: k8s.rules
  rules:
  - expr: |-
      sum by (cluster, namespace, pod, container) (
        rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m])
      ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
        1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
      )
    record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate
  - expr: |-
      container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
      * on (namespace, pod) group_left(node) topk by(namespace, pod) (1,
        max by(namespace, pod, node) (kube_pod_info{node!=""})
      )
    record: node_namespace_pod_container:container_memory_working_set_bytes
  - expr: |-
      container_memory_rss{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
      * on (namespace, pod) group_left(node) topk by(namespace, pod) (1,
        max by(namespace, pod, node) (kube_pod_info{node!=""})
      )
    record: node_namespace_pod_container:container_memory_rss
  - expr: |-
      container_memory_cache{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
      * on (namespace, pod) group_left(node) topk by(namespace, pod) (1,
        max by(namespace, pod, node) (kube_pod_info{node!=""})
      )
    record: node_namespace_pod_container:container_memory_cache
  - expr: |-
      container_memory_swap{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
      * on (namespace, pod) group_left(node) topk by(namespace, pod) (1,
        max by(namespace, pod, node) (kube_pod_info{node!=""})
      )
    record: node_namespace_pod_container:container_memory_swap
  - expr: |-
      sum by (namespace, cluster) (
          sum by (namespace, pod, cluster) (
              max by (namespace, pod, container, cluster) (
                kube_pod_container_resource_requests{resource="memory",job="kube-state-metrics"}
              ) * on(namespace, pod, cluster) group_left() max by (namespace, pod) (
                kube_pod_status_phase{phase=~"Pending|Running"} == 1
              )
          )
      )
    record: namespace_memory:kube_pod_container_resource_requests:sum
  - expr: |-
      sum by (namespace, cluster) (
          sum by (namespace, pod, cluster) (
              max by (namespace, pod, container, cluster) (
                kube_pod_container_resource_requests{resource="cpu",job="kube-state-metrics"}
              ) * on(namespace, pod, cluster) group_left() max by (namespace, pod) (
                kube_pod_status_phase{phase=~"Pending|Running"} == 1
              )
          )
      )
    record: namespace_cpu:kube_pod_container_resource_requests:sum
  - expr: |-
      max by (cluster, namespace, workload, pod) (
        label_replace(
          label_replace(
            kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"},
            "replicaset", "$1", "owner_name", "(.*)"
          ) * on(replicaset, namespace) group_left(owner_name) topk by(replicaset, namespace) (
            1, max by (replicaset, namespace, owner_name) (
              kube_replicaset_owner{job="kube-state-metrics"}
            )
          ),
          "workload", "$1", "owner_name", "(.*)"
        )
      )
    labels:
      workload_type: deployment
    record: namespace_workload_pod:kube_pod_owner:relabel
  - expr: |-
      max by (cluster, namespace, workload, pod) (
        label_replace(
          kube_pod_owner{job="kube-state-metrics", owner_kind="DaemonSet"},
          "workload", "$1", "owner_name", "(.*)"
        )
      )
    labels:
      workload_type: daemonset
    record: namespace_workload_pod:kube_pod_owner:relabel
  - expr: |-
      max by (cluster, namespace, workload, pod) (
        label_replace(
          kube_pod_owner{job="kube-state-metrics", owner_kind="StatefulSet"},
          "workload", "$1", "owner_name", "(.*)"
        )
      )
    labels:
      workload_type: statefulset
    record: namespace_workload_pod:kube_pod_owner:relabel

stephenrob avatar Jul 06 '21 11:07 stephenrob

This is an issue with the upstream recording rules - I've submitted a PR patch for this which should resolve the issue - https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/641

stephenrob avatar Jul 08 '21 10:07 stephenrob