k8sgpt icon indicating copy to clipboard operation
k8sgpt copied to clipboard

Preview of Google Managed Prometheus "PodMonitoring" custom resource analyzer.

Open bwplotka opened this issue 1 year ago • 0 comments

👋🏽 Hello! This PR is only meant to be an example of the custom analyzer that service providers could create for their Kubernetes objects. The main analyzer code is in podmonitoring.go file.

It's also meant to be a practical example of some additional features/extensions to K8sGPT I ~need (or played with) to make this work.

Usage

In principle, this analyzer allows analyzing the GMP PodMonitoring CR (very similar to Prometheus Operator PodMonitor).

For example, if someone mistype selector labels that should match the pods (veeeery common misconfiguration), we notice that:

image

Another example, if someone have correct selector labels, but the pods are not running:

image

Notice that the GMP analyzer schedules can schedule another analyzer (!), more on that in "Extensions".

Finally when you misconfigure the port field (another common mistake, hard to validate):

image

Extensions

To make this work, with all extras, I had to modify the K8sGPT on top custom analyzer code (plus CR types). Perhaps some of them would be useful to consider e.g. for other to use in their custom/built-in analyzers:

  1. Ability to select exact resource (e.g. pod) and analyze only that. Cherry picked plus fixed from https://github.com/k8sgpt-ai/k8sgpt/pull/483
  2. Verbose flag to pring more "debug" logs and prompt that will be used for LLM.
  3. Ability for analyzer to specify in common.Failure ANOTHER analyzers to run based on error. For example in our case we noticed PodMonitoring was not working, because Pod was not running. Scheduling pod analyzer can be helpful in this case and requires only minor code change 💪🏽
  4. LLM explanation is asked now per Error/Failure NOT only per Result. I was bit confused why we run LLM on multiple Failures within one result. For my analyzer it made more sense to run LLM on each Error/Failure. This allows better prompt customizations explained later on.
  5. Prompt templates use Go template now
  6. Full customization of prompt per common.Failure. Analyzers when creating potential common.Failure can adjust full prompt using common.Failure.CustomPromptTemplate as well common.Failure.AdditionalContextText and common.Failure.NextStepsText. This allows much better LLM results for specific problems.
  7. Not implemented here, but I started to explore common.Failure.UsefulQuestions logic, where analyzer could schedule specific questions to LLM and they want to render that answer to the general LLM answer. This will allow specifically ask for e.g. I see Pod X and PodMonitoring Y. Here are their YAMLs. Propose what's wrong here in 200 characters.. Otherwise it's impossible to get LLM to give you that in one prompt. Not implemented fully, nor verified.

Next Steps

  • Chat about each feature, if it makes sense (if useful)
  • Proper sanitation of those fields, cleanup of TODOs etc.
  • Custom plugins are essential here, we don't want to have/depend on some custom resources etc. Another idea on top of those discussed on Slack is to perhaps use integrations which when deployed could install another binary and use it for that analysis 🤔

Summary

Generally, writing useful analyzer takes time and it's a bit fragile. However it's super super useful. In fact from those specific findings you could even form --fix flag to fix things in place.

bwplotka avatar Jan 11 '24 16:01 bwplotka