k8sgpt
k8sgpt copied to clipboard
Preview of Google Managed Prometheus "PodMonitoring" custom resource analyzer.
👋🏽 Hello! This PR is only meant to be an example of the custom analyzer that service providers could create for their Kubernetes objects. The main analyzer code is in podmonitoring.go
file.
It's also meant to be a practical example of some additional features/extensions to K8sGPT I ~need (or played with) to make this work.
Usage
In principle, this analyzer allows analyzing the GMP PodMonitoring
CR (very similar to Prometheus Operator PodMonitor
).
For example, if someone mistype selector
labels that should match the pods (veeeery common misconfiguration), we notice that:
Another example, if someone have correct selector
labels, but the pods are not running:
Notice that the GMP analyzer schedules can schedule another analyzer (!), more on that in "Extensions".
Finally when you misconfigure the port
field (another common mistake, hard to validate):
Extensions
To make this work, with all extras, I had to modify the K8sGPT on top custom analyzer code (plus CR types). Perhaps some of them would be useful to consider e.g. for other to use in their custom/built-in analyzers:
- Ability to select exact resource (e.g. pod) and analyze only that. Cherry picked plus fixed from https://github.com/k8sgpt-ai/k8sgpt/pull/483
- Verbose flag to pring more "debug" logs and prompt that will be used for LLM.
- Ability for analyzer to specify in
common.Failure
ANOTHER analyzers to run based on error. For example in our case we noticed PodMonitoring was not working, becausePod
was not running. Scheduling pod analyzer can be helpful in this case and requires only minor code change 💪🏽 - LLM explanation is asked now per Error/Failure NOT only per Result. I was bit confused why we run LLM on multiple Failures within one result. For my analyzer it made more sense to run LLM on each Error/Failure. This allows better prompt customizations explained later on.
- Prompt templates use Go template now
- Full customization of prompt per
common.Failure
. Analyzers when creating potential common.Failure can adjust full prompt usingcommon.Failure.CustomPromptTemplate
as wellcommon.Failure.AdditionalContextText
andcommon.Failure.NextStepsText
. This allows much better LLM results for specific problems. - Not implemented here, but I started to explore
common.Failure.UsefulQuestions
logic, where analyzer could schedule specific questions to LLM and they want to render that answer to the general LLM answer. This will allow specifically ask for e.g.I see Pod X and PodMonitoring Y. Here are their YAMLs. Propose what's wrong here in 200 characters.
. Otherwise it's impossible to get LLM to give you that in one prompt. Not implemented fully, nor verified.
Next Steps
- Chat about each feature, if it makes sense (if useful)
- Proper sanitation of those fields, cleanup of TODOs etc.
- Custom plugins are essential here, we don't want to have/depend on some custom resources etc. Another idea on top of those discussed on Slack is to perhaps use
integrations
which when deployed could install another binary and use it for that analysis 🤔
Summary
Generally, writing useful analyzer takes time and it's a bit fragile. However it's super super useful. In fact from those specific findings you could even form --fix
flag to fix things in place.