retina icon indicating copy to clipboard operation
retina copied to clipboard

retina-operator shows up as unhealthy in Prometheus targets

Open peterj opened this issue 1 year ago • 3 comments

Describe the bug The retina-operator pod shows up as unhealthy in the prometheus targets list.

To Reproduce Using an AKS cluster with 2 nodes.

  1. Install Retina:
helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \
    --version v0.0.2 \
    --namespace kube-system \
    --set image.tag=v0.0.2 \
    --set operator.tag=v0.0.2 \
    --set image.pullPolicy=Always \
    --set logLevel=info \
    --set os.windows=true \
    --set operator.enabled=true \
    --set operator.enableRetinaEndpoint=true \
    --skip-crds \
    --set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\,packetparser\]" \
    --set enablePodLevel=true \
    --set enableAnnotations=true
  1. Install Prometheus (follow the instructions in the doc):
helm install prometheus -n kube-system -f deploy/prometheus/values.yaml prometheus-community/kube-prometheus-stack
  1. Open Prometheus and go to localhost:9090/targets

Expected behavior retina-pods should be all green. The retina operator pod either shouldn't be included in the targets list or the endpoint/port should be fixed (in case the operator is serving metrics as well). (note that the operator pod is up and running)

Actual behavior

The two retina agent pods are up and running, however the retina-operator pod shows up in red (unhealthy).

Screenshots Screenshot 2024-03-28 at 2 04 16 PM

Platform (please complete the following information):

  • OS:Linux (AKS)
  • Kubernetes Version: 1.27.9
  • Host:AKS
  • Retina Version:v.0.0.2

peterj avatar Mar 28 '24 21:03 peterj

+1, seeing the same thing

CecileRobertMichon avatar Apr 08 '24 21:04 CecileRobertMichon

Hey @peterj and @CecileRobertMichon, this is a benign error and shouldn't affect anything (outside of Prometheus saying operator is down). The operator doesn't serve any metrics and shouldn't be included in this scrape config; we probably have accidentally caught it with the label selector and will adjust that.

rbtr avatar Apr 08 '24 22:04 rbtr

I think the cause is https://github.com/microsoft/retina/blob/main/deploy/prometheus/retina/prometheus-config#L10 where instead of matching specifically the Retina daemonset pods, we're catching retina-operator due to too broad retina(.*) regex. Good first issue for someone to fix 🙂

rbtr avatar Apr 08 '24 22:04 rbtr

The PodMonitor or ServiceMonitor deployed via #695 should be the more specific means for capturing the relevant pods to scrape by Prometheus and this issue can be closed.

whatnick avatar Sep 21 '24 09:09 whatnick

@whatnick for now I pushed #770 with the additionalScrapeConfigs fix, since in the doc we are providing the instructions to deploy Prometheus with the existing scrapeConfig under deploy/legacy/prometheus/values.yaml, which is the cause of the issue reported here.

SRodi avatar Sep 23 '24 13:09 SRodi