prometheus-engine icon indicating copy to clipboard operation
prometheus-engine copied to clipboard

secrets: Improve debuggability & reliability of misconfigured *Monitoring CRs with secrets.

Open bwplotka opened this issue 10 months ago • 0 comments

(This relates to unreleased feature from https://github.com/GoogleCloudPlatform/prometheus-engine/pull/776 PR)

When the secret is configured in e.g. PodMonitoring but not found by the Prometheus we get nice Target Page error:

image

Hopefully this works with Target Status feature too. I think it does not fail the Prometheus config apply, but didn't check.

However, when user forgets to add permissions for the existing, well-referenced secret, the Prometheus scrape config parsing (and reloading) fails, we get cryptic unknown error and status page shows 401 unauthorized.

Full log:

{"caller":"main.go:1326","err":"unable to watch secret default/go-synthetic-basic-auth: unknown (get secrets)","level":"error","msg":"Failed to apply configuration","ts":"2024-03-26T21:24:20.265Z"}
{"caller":"main.go:1043","err":"one or more errors occurred while applying the new configuration (--config.file=\"/prometheus/config_out/config.yaml\")","level":"error","msg":"Error reloading config","ts":"2024-03-26T21:24:20.266Z

Consequences for failing config reloading are not as bad as I initially thought, it's only per reloader per job functionality got stopped in some state, but perhaps there is a way to have consistent status page error instead of failing applying.

I have rdy GKE cluster with your changes applied (will have it running for some time) if you want to check e.g. @TheSpiritXIII

AC

  • Ideally permission error does not fail configuration apply but behave similar to not found secret or not found port etc.
  • Ideally permission error results in more descriptive error log/status than "unknown"
  • Double check target status feature for not found / no permission errors related to secrets

Nice to have:

  • Ideally operator logs (or provides in status or via webhook) the exact RBAC role + binding to apply when missing. This is hard to do a bit on webhook, easy to log on collector though. The latter however is bit deep to find by customers. Putting two small-ish YAMLs through target status might be odd two (maybe fine?). For this case we might want to put it in "analysis/troubleshooting" CLI/functionality we discussed one day..

bwplotka avatar Mar 27 '24 09:03 bwplotka