loki icon indicating copy to clipboard operation
loki copied to clipboard

Loki gateway metrics (Nginx)

Open DanielCastronovo opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe. I'm not be able to view if Loki Gateway (Nginx) is fully operational. Only logs.

Describe the solution you'd like Enable nginx exporter + service monitor + create a dashboard + alert

DanielCastronovo avatar May 25 '23 16:05 DanielCastronovo

Hey, i enabled monitoring in the helm chart but getting targetDown for loki-gateway scraper

monitoring:
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false
  dashboards:
    enabled: true
  rules:
    enabled: true
  serviceMonitor:
    enabled: true
  lokiCanary:
    enabled: false

Alerts:

[FIRING:1] :warning: TargetDown • 100% of the monitoring/loki-gateway/loki-gateway targets in monitoring namespace are down.

This is using alertmanager with prometheus, any ideas on what values do i need to configure nginx-exporter for loki-gateway pod in kubernetes?

Cheers

paltaa avatar May 06 '24 17:05 paltaa

Took a look at the rendered CRD's

Name:         loki
Namespace:    monitoring
Labels:       app.kubernetes.io/instance=loki
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=loki
              app.kubernetes.io/version=3.0.0
              argocd.argoproj.io/instance=loki
              helm.sh/chart=loki-6.5.0
Annotations:  <none>
API Version:  monitoring.coreos.com/v1
Kind:         ServiceMonitor
Metadata:
  Creation Timestamp:  2024-02-28T13:15:15Z
  Generation:          1
  Resource Version:    40402766
  UID:                 7d63382c-2cf4-45ab-9200-f3239a2dda76
Spec:
  Endpoints:
    Interval:  15s
    Path:      /metrics
    Port:      http-metrics
    Relabelings:
      Action:       replace
      Replacement:  monitoring/$1
      Source Labels:
        job
      Target Label:  job
      Action:        replace
      Replacement:   loki
      Target Label:  cluster
    Scheme:          http
  Selector:
    Match Expressions:
      Key:       prometheus.io/service-monitor
      Operator:  NotIn
      Values:
        false
    Match Labels:
      app.kubernetes.io/instance:  loki
      app.kubernetes.io/name:      loki
Events:                            <none>

Its just a serviceMonitor pointing to a broken service endpoint so we can safely delete for the moment:

monitoring:
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false
  dashboards:
    enabled: false
  rules:
    enabled: false
  serviceMonitor:
    enabled: false
  lokiCanary:
    enabled: false

paltaa avatar May 08 '24 17:05 paltaa

Seems like /metrics path is not defined in nginx.conf for loki-gateway: https://github.com/grafana/loki/blob/main/production/helm/loki/templates/_helpers.tpl#L750-L1014

But this endpoint is defined for loki-gateway deployment template: https://github.com/grafana/loki/blob/main/production/helm/loki/templates/gateway/deployment-gateway-nginx.yaml#L63-L66

Servicemonitor is created for Prometheus to scrape all http-metrics endpoints, so it gets 404 when it tries to scrape /metrics:

10.244.4.42 - - [26/May/2024:10:01:37 +0000]  404 "GET /metrics HTTP/1.1" 153 "-" "Prometheus/2.51.1" "-"
10.244.4.42 - - [26/May/2024:10:01:52 +0000]  404 "GET /metrics HTTP/1.1" 153 "-" "Prometheus/2.51.1" "-"

IMO the dirty way is to set serviceMonitor.enabled: false as @paltaa suggested. But it disables monitoring for the whole loki deployment.

Eyeless77 avatar May 26 '24 10:05 Eyeless77

Looks like previously in 2.x helm charts the endpoint name was just http: https://github.com/grafana/loki/blob/v2.9.8/production/helm/loki/templates/gateway/deployment-gateway.yaml#L62

And now it's changed for http-metrics and is also used by readinessProbe for gateway deployment: https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L1019-L1022

Eyeless77 avatar May 26 '24 10:05 Eyeless77

Suffering from the same issue.

A bit nicer workaround: the serviceMonitor contains a check where the label prometheus.io/service-monitor: "false" may not be present on your service. So by adding that to your Gateway service it should be excluded, until the above is fixed in the helm chart itself.

values.yaml

gateway:
  service:
    labels:
      prometheus.io/service-monitor: "false"

Pionerd avatar May 26 '24 11:05 Pionerd

In our case before the upgrade to v3 (chart: v5.20.0) we didn't have prometheus scraping of the gateway pods likely because the port names didn't match.

kind: ServiceMonitor
  endpoints:
    - port: http-metrics
      path: /metrics
kind: Deployment
metadata:
  name: loki-gateway
          ports:
            - name: http

After upgrading to v3 (v6.6.1) we got monitoring of gateway pods (the gateway pods got http-metrics port), but since we enabled auth on the gateway (basicAuth: enabled: true), prometheus scraping is getting 401 response.

server returned HTTP status 401 Unauthorized
http://10.1.5.228:8080/metrics

What is the best practice here? Is it possible to add an option to disable authentication only for metrics endpoint in the gateway-nginx via helm-chart? Or is adding auth credentials for prometheus scraping a preferred option here?

akorp avatar May 29 '24 08:05 akorp

@akorp the issue is not auth, the issue is that /metrics is not handled, having auth enabled just fails the request with a 401 instead of 404.

This commit introduced the change seemingly as a drive-by: https://github.com/grafana/loki/commit/79b876b65d55c54f4d532e98dc24743dea8bedec#diff-d79225d50b6c12d41bceaed705a35fd5b5fff56f829fbbe5744ce6be632a0038

I think the port rename should be reverted. Until then @Pionerd's workaround is probably the best.

pschichtel avatar Jun 03 '24 01:06 pschichtel

@DanielCastronovo How is this completed?

Pionerd avatar Jun 13 '24 12:06 Pionerd

Still seems to be an issue here as well.

Worked-around using:

gateway:
  service:
    labels:
      prometheus.io/service-monitor: "false"

ThePooN avatar Jun 21 '24 16:06 ThePooN

Not completed still an issue. Please reopen.

Probably the closed it because they move their monitoring to this new even less complete meta monitoring chart.....

ohdearaugustin avatar Jun 26 '24 19:06 ohdearaugustin

same issue.

konglingning avatar Aug 12 '24 08:08 konglingning

Same. Please reopen.

KA-ROM avatar Aug 22 '24 10:08 KA-ROM

I recently upgraded to v6.10.0 of the helm chart and experienced this same issue. I worked around it by deploying nginx-prometheus-exporter along side nginx in the loki-gateway deployment. This how I did it:

loki chart values snippet

gateway:
  nginxConfig:
    serverSnippet: |
      location = /stub_status {
        stub_status on;
        allow 127.0.0.1;
        deny all;
      }
      location = /metrics {
        proxy_pass       http://127.0.0.1:9113/metrics;
      }
  extraContainers:
    - name: nginx-exporter
      securityContext:
        allowPrivilegeEscalation: false
      image: nginx/nginx-prometheus-exporter:1.3.0
      imagePullPolicy: IfNotPresent
      ports:
        - containerPort: 9113
          name: http-exporter
      resources:
        limits:
          memory: 128Mi
          cpu: 500m
        requests:
          memory: 64Mi
          cpu: 100m

vrivellino avatar Aug 28 '24 16:08 vrivellino

I recently upgraded to v6.10.0 of the helm chart and experienced this same issue. I worked around it by deploying nginx-prometheus-exporter along side nginx in the loki-gateway deployment. This how I did it:

loki chart values snippet

Thanks for this, I too just ran into this with the chart upgrade.

hollanbm avatar Aug 29 '24 02:08 hollanbm