dashboard icon indicating copy to clipboard operation
dashboard copied to clipboard

[Monitoring V2 UI] set default values for new options depended on the cluster type

Open jiaqiluo opened this issue 3 years ago • 1 comments

To resolve this issue, we will introduce two new sub-charts, rkeIngressNginx and rke2IngressNginx, to scrape metrics from the ingress-nginx Deployment/DaemeonSet in rke and rke2 clusters respectively.

We need to update the dashboard so it can set some default values when we install or upgrade the monitoring v2 in the dashboard:

  • if the cluster is an RKE cluster, set rkeIngressNginx.enabled=true
  • if the cluster is an RKE2 cluster, set rke2IngressNginx.enabled=true

If the UI can detect the k8s version of an RKE2 cluster, set rke2IngressNginx.deployment.enabled=true when k8s version <= 1.20

Here is the full list of the new options in the value.yaml file:

rkeIngressNginx:
  enabled: false
  metricsPort: 10254
  component: ingress-nginx
  clients:
    port: 10015
    useLocalhost: true
    tolerations:
      - effect: "NoExecute"
        operator: "Exists"
      - effect: "NoSchedule"
        operator: "Exists"
    nodeSelector:
      node-role.kubernetes.io/worker: "true"
rke2IngressNginx:
  enabled: false
  metricsPort: 10254
  component: ingress-nginx
  clients:
    port: 10015
    useLocalhost: true
    tolerations:
      - effect: "NoExecute"
        operator: "Exists"
      - effect: "NoSchedule"
        operator: "Exists"
    # in the RKE2 cluster, the ingress-nginx-controller is deployed as
    # a Deployment with 1 pod when RKE2 version is <= 1.20,
    # a DaemonSet when RKE2 version is >= 1.21
    deployment:
      enabled: false
      replicas: 1
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app.kubernetes.io/component"
                    operator: "In"
                    values:
                      - "controller"
              topologyKey: "kubernetes.io/hostname"
              namespaces:
                - "kube-system"

jiaqiluo avatar Jun 08 '21 17:06 jiaqiluo

Hi, There is an alert on cluster dashboard: [100% of the ingress-nginx/pushprox-ingress-nginx-client targets in cattle-monitoring-system namespace are down](https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-targetdown)

I checked the logs of pushprox-ingress-nginx-client pod:

level=info ts=2022-10-09T21:37:54.303Z caller=main.go:232 msg="Got scrape request" scrape_id=7f394d24-98f3-4eda-a795-f64967426f8a url=http://<agent-host-ip>:10254/metrics
level=error ts=2022-10-09T21:37:54.303Z caller=main.go:101 err="failed to scrape http://127.0.0.1:10254/metrics: Get \"http://127.0.0.1:10254/metrics\": dial tcp 127.0.0.1:10254: connect: connection refused"
level=info ts=2022-10-09T21:37:54.304Z caller=main.go:113 msg="Pushed failed scrape response"

And I checked the output of Prometheus Targets:

serviceMonitor/cattle-monitoring-system/rancher-monitoring-ingress-nginx/0 (0/1 up)

Endpoint State Labels Last Scrape Scrape Duration Error
http://:10254/metrics DOWN component="ingress-nginx"endpoint="metrics"instance=":10254"job="ingress-nginx"namespace="cattle-monitoring-system"pod="pushprox-ingress-nginx-client-67fdccf9d-qxg8w"service="pushprox-ingress-nginx-client" 25.457s ago 3.319ms server returned HTTP status 500 Internal Server Error

image

from helm chart values.yaml:

rke2IngressNginx:
  clients:
    affinity:
      podAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                    - controller
            namespaces:
              - kube-system
            topologyKey: kubernetes.io/hostname
    deployment:
      enabled: true
      replicas: 1
    port: 10015
    tolerations:
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
    useLocalhost: true
  component: ingress-nginx
  enabled: true
  kubeVersionOverrides:
    - constraint: <= 1.20
      values:
        clients:
          deployment:
            enabled: false
  metricsPort: 10254
<agent-node>:~# k get svc -n cattle-monitoring-system
NAME                                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                         ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP   34h
prometheus-operated                           ClusterIP   None            <none>        9090/TCP                     34h
pushprox-ingress-nginx-client                 ClusterIP   10.43.150.200   <none>        10254/TCP                    34h
pushprox-ingress-nginx-proxy                  ClusterIP   10.43.179.36    <none>        8080/TCP                     34h
pushprox-kube-controller-manager-client       ClusterIP   10.43.60.93     <none>        10257/TCP                    34h
pushprox-kube-controller-manager-proxy        ClusterIP   10.43.1.192     <none>        8080/TCP                     34h
pushprox-kube-etcd-client                     ClusterIP   10.43.240.185   <none>        2381/TCP                     34h
pushprox-kube-etcd-proxy                      ClusterIP   10.43.21.180    <none>        8080/TCP                     34h
pushprox-kube-proxy-client                    ClusterIP   10.43.148.65    <none>        10249/TCP                    34h
pushprox-kube-proxy-proxy                     ClusterIP   10.43.226.62    <none>        8080/TCP                     34h
pushprox-kube-scheduler-client                ClusterIP   10.43.122.24    <none>        10259/TCP                    34h
pushprox-kube-scheduler-proxy                 ClusterIP   10.43.39.26     <none>        8080/TCP                     34h
rancher-monitoring-alertmanager               ClusterIP   10.43.20.177    <none>        9093/TCP                     34h
rancher-monitoring-grafana                    ClusterIP   10.43.84.131    <none>        80/TCP                       34h
rancher-monitoring-kube-state-metrics         ClusterIP   10.43.98.216    <none>        8080/TCP                     34h
rancher-monitoring-operator                   ClusterIP   10.43.22.230    <none>        443/TCP                      34h
rancher-monitoring-prometheus                 ClusterIP   10.43.97.193    <none>        9090/TCP                     34h
rancher-monitoring-prometheus-adapter         ClusterIP   10.43.171.251   <none>        443/TCP                      34h
rancher-monitoring-prometheus-node-exporter   ClusterIP   10.43.42.19     <none>        9796/TCP                     34h

<agent-node>:~# ss -tulpn
Netid     State      Recv-Q     Send-Q         Local Address:Port          Peer Address:Port     Process
udp       UNCONN     0          0              127.0.0.53%lo:53                 0.0.0.0:*         users:(("systemd-resolve",pid=917,fd=12))
udp       UNCONN     0          0                    0.0.0.0:111                0.0.0.0:*         users:(("rpcbind",pid=881,fd=5),("systemd",pid=1,fd=119))
udp       UNCONN     0          0                    0.0.0.0:8472               0.0.0.0:*
udp       UNCONN     0          0                  127.0.0.1:323                0.0.0.0:*         users:(("chronyd",pid=2297621,fd=5))
udp       UNCONN     0          0                       [::]:111                   [::]:*         users:(("rpcbind",pid=881,fd=7),("systemd",pid=1,fd=121))
udp       UNCONN     0          0                      [::1]:323                   [::]:*         users:(("chronyd",pid=2297621,fd=6))
tcp       LISTEN     0          4096               127.0.0.1:10248              0.0.0.0:*         users:(("kubelet",pid=2335393,fd=22))
tcp       LISTEN     0          4096               127.0.0.1:10249              0.0.0.0:*         users:(("kube-proxy",pid=2225,fd=13))
tcp       LISTEN     0          4096               127.0.0.1:9099               0.0.0.0:*         users:(("calico-node",pid=3607320,fd=9))
tcp       LISTEN     0          4096               127.0.0.1:6443               0.0.0.0:*         users:(("rke2",pid=2335348,fd=18))
tcp       LISTEN     0          4096               127.0.0.1:6444               0.0.0.0:*         users:(("rke2",pid=2335348,fd=8))
tcp       LISTEN     0          4096                 0.0.0.0:111                0.0.0.0:*         users:(("rpcbind",pid=881,fd=4),("systemd",pid=1,fd=118))
tcp       LISTEN     0          4096               127.0.0.1:10256              0.0.0.0:*         users:(("kube-proxy",pid=2225,fd=7))
tcp       LISTEN     0          4096           127.0.0.53%lo:53                 0.0.0.0:*         users:(("systemd-resolve",pid=917,fd=13))
tcp       LISTEN     0          128                  0.0.0.0:22                 0.0.0.0:*         users:(("sshd",pid=994,fd=3))
tcp       LISTEN     0          100                127.0.0.1:25                 0.0.0.0:*         users:(("master",pid=2290290,fd=13))
tcp       LISTEN     0          128                127.0.0.1:6010               0.0.0.0:*         users:(("sshd",pid=3610407,fd=10))
tcp       LISTEN     0          4096               127.0.0.1:10010              0.0.0.0:*         users:(("containerd",pid=2335363,fd=361))
tcp       LISTEN     0          4096                       *:10250                    *:*         users:(("kubelet",pid=2335393,fd=34))
tcp       LISTEN     0          4096                    [::]:111                   [::]:*         users:(("rpcbind",pid=881,fd=6),("systemd",pid=1,fd=120))
tcp       LISTEN     0          128                     [::]:22                    [::]:*         users:(("sshd",pid=994,fd=4))
tcp       LISTEN     0          100                    [::1]:25                    [::]:*         users:(("master",pid=2290290,fd=14))
tcp       LISTEN     0          4096                       *:9369                     *:*         users:(("pushprox-client",pid=2184913,fd=3))
tcp       LISTEN     0          128                    [::1]:6010                  [::]:*         users:(("sshd",pid=3610407,fd=9))
tcp       LISTEN     0          4096                       *:9091                     *:*         users:(("calico-node",pid=3607320,fd=14))
tcp       LISTEN     0          4096                       *:9796                     *:*         users:(("node_exporter",pid=2184852,fd=3))

<agent-node>:~# k get po -n cattle-monitoring-system
NAME                                                      READY   STATUS    RESTARTS   AGE
alertmanager-rancher-monitoring-alertmanager-0            2/2     Running   0          34h
prometheus-rancher-monitoring-prometheus-0                3/3     Running   0          34h
pushprox-ingress-nginx-client-67fdccf9d-qxg8w             1/1     Running   0          54m
pushprox-ingress-nginx-proxy-5497b7dbd-p9mbt              1/1     Running   0          34h
pushprox-kube-controller-manager-client-6vcg6             1/1     Running   0          34h
pushprox-kube-controller-manager-proxy-64f6dc94c6-l5ml2   1/1     Running   0          34h
pushprox-kube-etcd-client-f6qx7                           1/1     Running   0          34h
pushprox-kube-etcd-proxy-55544d768d-pphxx                 1/1     Running   0          34h
pushprox-kube-proxy-client-bmxpx                          1/1     Running   0          34h
pushprox-kube-proxy-client-cx6r2                          1/1     Running   0          34h
pushprox-kube-proxy-client-m8knf                          1/1     Running   0          34h
pushprox-kube-proxy-proxy-85f89bcc4d-cp6cd                1/1     Running   0          34h
pushprox-kube-scheduler-client-wldxg                      1/1     Running   0          34h
pushprox-kube-scheduler-proxy-6cb664c86b-8c7mq            1/1     Running   0          34h
rancher-monitoring-grafana-586df56bff-nlvgz               3/3     Running   0          34h
rancher-monitoring-kube-state-metrics-77ddfd789b-tmjvn    1/1     Running   0          34h
rancher-monitoring-operator-79cdfbcf48-nh9ck              1/1     Running   0          34h
rancher-monitoring-prometheus-adapter-79d8db9697-nvsxv    1/1     Running   0          34h
rancher-monitoring-prometheus-node-exporter-jd8wx         1/1     Running   0          34h
rancher-monitoring-prometheus-node-exporter-s6b2z         1/1     Running   0          34h
rancher-monitoring-prometheus-node-exporter-vb52j         1/1     Running   0          34h

Grafana node exporter/nodes dashboard generally works, but sometimes throws these errors: image

What is wrong? Still trying to understand the rancher, rke2 and its monitoring integrations etc. sorry if this is a simple question...

ugurserhattoy avatar Oct 09 '22 22:10 ugurserhattoy

@jiaqiluo I saw that the corresponding ticket has been finished for a while - https://github.com/rancher/charts/pull/1227. Is the frontend work still needed?

catherineluse avatar Nov 05 '22 22:11 catherineluse