retina-operator shows up as unhealthy in Prometheus targets
Describe the bug
The retina-operator pod shows up as unhealthy in the prometheus targets list.
To Reproduce Using an AKS cluster with 2 nodes.
- Install Retina:
helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \
--version v0.0.2 \
--namespace kube-system \
--set image.tag=v0.0.2 \
--set operator.tag=v0.0.2 \
--set image.pullPolicy=Always \
--set logLevel=info \
--set os.windows=true \
--set operator.enabled=true \
--set operator.enableRetinaEndpoint=true \
--skip-crds \
--set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\,packetparser\]" \
--set enablePodLevel=true \
--set enableAnnotations=true
- Install Prometheus (follow the instructions in the doc):
helm install prometheus -n kube-system -f deploy/prometheus/values.yaml prometheus-community/kube-prometheus-stack
- Open Prometheus and go to
localhost:9090/targets
Expected behavior
retina-pods should be all green. The retina operator pod either shouldn't be included in the targets list or the endpoint/port should be fixed (in case the operator is serving metrics as well). (note that the operator pod is up and running)
Actual behavior
The two retina agent pods are up and running, however the retina-operator pod shows up in red (unhealthy).
Screenshots
Platform (please complete the following information):
- OS:Linux (AKS)
- Kubernetes Version: 1.27.9
- Host:AKS
- Retina Version:v.0.0.2
+1, seeing the same thing
Hey @peterj and @CecileRobertMichon, this is a benign error and shouldn't affect anything (outside of Prometheus saying operator is down). The operator doesn't serve any metrics and shouldn't be included in this scrape config; we probably have accidentally caught it with the label selector and will adjust that.
I think the cause is https://github.com/microsoft/retina/blob/main/deploy/prometheus/retina/prometheus-config#L10 where instead of matching specifically the Retina daemonset pods, we're catching retina-operator due to too broad retina(.*) regex. Good first issue for someone to fix 🙂
The PodMonitor or ServiceMonitor deployed via #695 should be the more specific means for capturing the relevant pods to scrape by Prometheus and this issue can be closed.
@whatnick for now I pushed #770 with the additionalScrapeConfigs fix, since in the doc we are providing the instructions to deploy Prometheus with the existing scrapeConfig under deploy/legacy/prometheus/values.yaml, which is the cause of the issue reported here.