retina retina-operator shows up as unhealthy in Prometheus targets

Describe the bug The retina-operator pod shows up as unhealthy in the prometheus targets list.

To Reproduce Using an AKS cluster with 2 nodes.

Install Retina:

helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \
    --version v0.0.2 \
    --namespace kube-system \
    --set image.tag=v0.0.2 \
    --set operator.tag=v0.0.2 \
    --set image.pullPolicy=Always \
    --set logLevel=info \
    --set os.windows=true \
    --set operator.enabled=true \
    --set operator.enableRetinaEndpoint=true \
    --skip-crds \
    --set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\,packetparser\]" \
    --set enablePodLevel=true \
    --set enableAnnotations=true

Install Prometheus (follow the instructions in the doc):

helm install prometheus -n kube-system -f deploy/prometheus/values.yaml prometheus-community/kube-prometheus-stack

Open Prometheus and go to localhost:9090/targets

Expected behavior retina-pods should be all green. The retina operator pod either shouldn't be included in the targets list or the endpoint/port should be fixed (in case the operator is serving metrics as well). (note that the operator pod is up and running)

Actual behavior

The two retina agent pods are up and running, however the retina-operator pod shows up in red (unhealthy).

Screenshots Screenshot 2024-03-28 at 2 04 16 PM

Platform (please complete the following information):

OS:Linux (AKS)
Kubernetes Version: 1.27.9
Host:AKS
Retina Version:v.0.0.2

Mar 28 '24 21:03 peterj

+1, seeing the same thing

Apr 08 '24 21:04 CecileRobertMichon

Hey @peterj and @CecileRobertMichon, this is a benign error and shouldn't affect anything (outside of Prometheus saying operator is down). The operator doesn't serve any metrics and shouldn't be included in this scrape config; we probably have accidentally caught it with the label selector and will adjust that.

Apr 08 '24 22:04 rbtr

I think the cause is https://github.com/microsoft/retina/blob/main/deploy/prometheus/retina/prometheus-config#L10 where instead of matching specifically the Retina daemonset pods, we're catching retina-operator due to too broad retina(.*) regex. Good first issue for someone to fix 🙂

Apr 08 '24 22:04 rbtr

The PodMonitor or ServiceMonitor deployed via #695 should be the more specific means for capturing the relevant pods to scrape by Prometheus and this issue can be closed.

Sep 21 '24 09:09 whatnick

@whatnick for now I pushed #770 with the additionalScrapeConfigs fix, since in the doc we are providing the instructions to deploy Prometheus with the existing scrapeConfig under deploy/legacy/prometheus/values.yaml, which is the cause of the issue reported here.

Sep 23 '24 13:09 SRodi