dashboard
dashboard copied to clipboard
[Monitoring V2 UI] set default values for new options depended on the cluster type
To resolve this issue, we will introduce two new sub-charts, rkeIngressNginx
and rke2IngressNginx
, to scrape metrics from the ingress-nginx
Deployment/DaemeonSet in rke and rke2 clusters respectively.
We need to update the dashboard so it can set some default values when we install or upgrade the monitoring v2 in the dashboard:
- if the cluster is an RKE cluster, set
rkeIngressNginx.enabled=true
- if the cluster is an RKE2 cluster, set
rke2IngressNginx.enabled=true
If the UI can detect the k8s version of an RKE2 cluster, set rke2IngressNginx.deployment.enabled=true
when k8s version <= 1.20
Here is the full list of the new options in the value.yaml file:
rkeIngressNginx:
enabled: false
metricsPort: 10254
component: ingress-nginx
clients:
port: 10015
useLocalhost: true
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
nodeSelector:
node-role.kubernetes.io/worker: "true"
rke2IngressNginx:
enabled: false
metricsPort: 10254
component: ingress-nginx
clients:
port: 10015
useLocalhost: true
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
# in the RKE2 cluster, the ingress-nginx-controller is deployed as
# a Deployment with 1 pod when RKE2 version is <= 1.20,
# a DaemonSet when RKE2 version is >= 1.21
deployment:
enabled: false
replicas: 1
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app.kubernetes.io/component"
operator: "In"
values:
- "controller"
topologyKey: "kubernetes.io/hostname"
namespaces:
- "kube-system"
Hi,
There is an alert on cluster dashboard:
[100% of the ingress-nginx/pushprox-ingress-nginx-client targets in cattle-monitoring-system namespace are down](https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-targetdown)
I checked the logs of pushprox-ingress-nginx-client pod:
level=info ts=2022-10-09T21:37:54.303Z caller=main.go:232 msg="Got scrape request" scrape_id=7f394d24-98f3-4eda-a795-f64967426f8a url=http://<agent-host-ip>:10254/metrics
level=error ts=2022-10-09T21:37:54.303Z caller=main.go:101 err="failed to scrape http://127.0.0.1:10254/metrics: Get \"http://127.0.0.1:10254/metrics\": dial tcp 127.0.0.1:10254: connect: connection refused"
level=info ts=2022-10-09T21:37:54.304Z caller=main.go:113 msg="Pushed failed scrape response"
And I checked the output of Prometheus Targets:
serviceMonitor/cattle-monitoring-system/rancher-monitoring-ingress-nginx/0 (0/1 up)
Endpoint | State | Labels | Last Scrape | Scrape Duration | Error |
---|---|---|---|---|---|
http:// |
DOWN | component="ingress-nginx"endpoint="metrics"instance=" |
25.457s ago | 3.319ms | server returned HTTP status 500 Internal Server Error |
from helm chart values.yaml:
rke2IngressNginx:
clients:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- controller
namespaces:
- kube-system
topologyKey: kubernetes.io/hostname
deployment:
enabled: true
replicas: 1
port: 10015
tolerations:
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
useLocalhost: true
component: ingress-nginx
enabled: true
kubeVersionOverrides:
- constraint: <= 1.20
values:
clients:
deployment:
enabled: false
metricsPort: 10254
<agent-node>:~# k get svc -n cattle-monitoring-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 34h
prometheus-operated ClusterIP None <none> 9090/TCP 34h
pushprox-ingress-nginx-client ClusterIP 10.43.150.200 <none> 10254/TCP 34h
pushprox-ingress-nginx-proxy ClusterIP 10.43.179.36 <none> 8080/TCP 34h
pushprox-kube-controller-manager-client ClusterIP 10.43.60.93 <none> 10257/TCP 34h
pushprox-kube-controller-manager-proxy ClusterIP 10.43.1.192 <none> 8080/TCP 34h
pushprox-kube-etcd-client ClusterIP 10.43.240.185 <none> 2381/TCP 34h
pushprox-kube-etcd-proxy ClusterIP 10.43.21.180 <none> 8080/TCP 34h
pushprox-kube-proxy-client ClusterIP 10.43.148.65 <none> 10249/TCP 34h
pushprox-kube-proxy-proxy ClusterIP 10.43.226.62 <none> 8080/TCP 34h
pushprox-kube-scheduler-client ClusterIP 10.43.122.24 <none> 10259/TCP 34h
pushprox-kube-scheduler-proxy ClusterIP 10.43.39.26 <none> 8080/TCP 34h
rancher-monitoring-alertmanager ClusterIP 10.43.20.177 <none> 9093/TCP 34h
rancher-monitoring-grafana ClusterIP 10.43.84.131 <none> 80/TCP 34h
rancher-monitoring-kube-state-metrics ClusterIP 10.43.98.216 <none> 8080/TCP 34h
rancher-monitoring-operator ClusterIP 10.43.22.230 <none> 443/TCP 34h
rancher-monitoring-prometheus ClusterIP 10.43.97.193 <none> 9090/TCP 34h
rancher-monitoring-prometheus-adapter ClusterIP 10.43.171.251 <none> 443/TCP 34h
rancher-monitoring-prometheus-node-exporter ClusterIP 10.43.42.19 <none> 9796/TCP 34h
<agent-node>:~# ss -tulpn
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
udp UNCONN 0 0 127.0.0.53%lo:53 0.0.0.0:* users:(("systemd-resolve",pid=917,fd=12))
udp UNCONN 0 0 0.0.0.0:111 0.0.0.0:* users:(("rpcbind",pid=881,fd=5),("systemd",pid=1,fd=119))
udp UNCONN 0 0 0.0.0.0:8472 0.0.0.0:*
udp UNCONN 0 0 127.0.0.1:323 0.0.0.0:* users:(("chronyd",pid=2297621,fd=5))
udp UNCONN 0 0 [::]:111 [::]:* users:(("rpcbind",pid=881,fd=7),("systemd",pid=1,fd=121))
udp UNCONN 0 0 [::1]:323 [::]:* users:(("chronyd",pid=2297621,fd=6))
tcp LISTEN 0 4096 127.0.0.1:10248 0.0.0.0:* users:(("kubelet",pid=2335393,fd=22))
tcp LISTEN 0 4096 127.0.0.1:10249 0.0.0.0:* users:(("kube-proxy",pid=2225,fd=13))
tcp LISTEN 0 4096 127.0.0.1:9099 0.0.0.0:* users:(("calico-node",pid=3607320,fd=9))
tcp LISTEN 0 4096 127.0.0.1:6443 0.0.0.0:* users:(("rke2",pid=2335348,fd=18))
tcp LISTEN 0 4096 127.0.0.1:6444 0.0.0.0:* users:(("rke2",pid=2335348,fd=8))
tcp LISTEN 0 4096 0.0.0.0:111 0.0.0.0:* users:(("rpcbind",pid=881,fd=4),("systemd",pid=1,fd=118))
tcp LISTEN 0 4096 127.0.0.1:10256 0.0.0.0:* users:(("kube-proxy",pid=2225,fd=7))
tcp LISTEN 0 4096 127.0.0.53%lo:53 0.0.0.0:* users:(("systemd-resolve",pid=917,fd=13))
tcp LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=994,fd=3))
tcp LISTEN 0 100 127.0.0.1:25 0.0.0.0:* users:(("master",pid=2290290,fd=13))
tcp LISTEN 0 128 127.0.0.1:6010 0.0.0.0:* users:(("sshd",pid=3610407,fd=10))
tcp LISTEN 0 4096 127.0.0.1:10010 0.0.0.0:* users:(("containerd",pid=2335363,fd=361))
tcp LISTEN 0 4096 *:10250 *:* users:(("kubelet",pid=2335393,fd=34))
tcp LISTEN 0 4096 [::]:111 [::]:* users:(("rpcbind",pid=881,fd=6),("systemd",pid=1,fd=120))
tcp LISTEN 0 128 [::]:22 [::]:* users:(("sshd",pid=994,fd=4))
tcp LISTEN 0 100 [::1]:25 [::]:* users:(("master",pid=2290290,fd=14))
tcp LISTEN 0 4096 *:9369 *:* users:(("pushprox-client",pid=2184913,fd=3))
tcp LISTEN 0 128 [::1]:6010 [::]:* users:(("sshd",pid=3610407,fd=9))
tcp LISTEN 0 4096 *:9091 *:* users:(("calico-node",pid=3607320,fd=14))
tcp LISTEN 0 4096 *:9796 *:* users:(("node_exporter",pid=2184852,fd=3))
<agent-node>:~# k get po -n cattle-monitoring-system
NAME READY STATUS RESTARTS AGE
alertmanager-rancher-monitoring-alertmanager-0 2/2 Running 0 34h
prometheus-rancher-monitoring-prometheus-0 3/3 Running 0 34h
pushprox-ingress-nginx-client-67fdccf9d-qxg8w 1/1 Running 0 54m
pushprox-ingress-nginx-proxy-5497b7dbd-p9mbt 1/1 Running 0 34h
pushprox-kube-controller-manager-client-6vcg6 1/1 Running 0 34h
pushprox-kube-controller-manager-proxy-64f6dc94c6-l5ml2 1/1 Running 0 34h
pushprox-kube-etcd-client-f6qx7 1/1 Running 0 34h
pushprox-kube-etcd-proxy-55544d768d-pphxx 1/1 Running 0 34h
pushprox-kube-proxy-client-bmxpx 1/1 Running 0 34h
pushprox-kube-proxy-client-cx6r2 1/1 Running 0 34h
pushprox-kube-proxy-client-m8knf 1/1 Running 0 34h
pushprox-kube-proxy-proxy-85f89bcc4d-cp6cd 1/1 Running 0 34h
pushprox-kube-scheduler-client-wldxg 1/1 Running 0 34h
pushprox-kube-scheduler-proxy-6cb664c86b-8c7mq 1/1 Running 0 34h
rancher-monitoring-grafana-586df56bff-nlvgz 3/3 Running 0 34h
rancher-monitoring-kube-state-metrics-77ddfd789b-tmjvn 1/1 Running 0 34h
rancher-monitoring-operator-79cdfbcf48-nh9ck 1/1 Running 0 34h
rancher-monitoring-prometheus-adapter-79d8db9697-nvsxv 1/1 Running 0 34h
rancher-monitoring-prometheus-node-exporter-jd8wx 1/1 Running 0 34h
rancher-monitoring-prometheus-node-exporter-s6b2z 1/1 Running 0 34h
rancher-monitoring-prometheus-node-exporter-vb52j 1/1 Running 0 34h
Grafana node exporter/nodes dashboard generally works, but sometimes throws these errors:
What is wrong? Still trying to understand the rancher, rke2 and its monitoring integrations etc. sorry if this is a simple question...
@jiaqiluo I saw that the corresponding ticket has been finished for a while - https://github.com/rancher/charts/pull/1227. Is the frontend work still needed?