helm-charts
helm-charts copied to clipboard
[kube-prometheus-stack] kube-apiserver.rules Error on ingesting out-of-order result from rule evaluation
Describe the bug a clear and concise description of what the bug is.
In a k8s cluster v 1.20.5, we deploy applications using ArgoCD and Helm charts. When we upgraded the kube-prometheus-stack chart 17.2.1, we started seing messages in the log of the form:
2021-08-19T09:24:53.993798106Z stderr F level=warn ts=2021-08-19T09:24:53.993Z caller=manager.go:617 component="rule manager" group=kube-apiserver.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=239
After a bit of investigation, we found that the rules in question were the ones defined in kube-apiserver-histogram.rules.yaml. The exact sames rules are also defined in kube-apiserver.rules.yaml. So, we tried to remove the rules from kube-apiserver-histogram.rules.yaml, and then re-deployed Prometheus, which made the error go away and no out of order messages appeared in the logs after that. I can see from the history that the file kube-apiserver-histogram.rules.yaml was created July 15th, and maybe the corresponding rules should have been removed from kube-apiserver.rules.yaml at that time to avoid duplicate rules?
What's your helm version?
version.BuildInfo{Version:"v3.5.1", GitCommit:"32c22239423b3b4ba6706d450bd044baffdcf9e6", GitTreeState:"clean", GoVersion:"go1.15.7"}
What's your kubectl version?
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.8", GitCommit:"fd5d41537aee486160ad9b5356a9d82363273721", GitTreeState:"clean", BuildDate:"2021-02-17T12:41:51Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:02:01Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Which chart?
kube-prometheus-stack
What's the chart version?
17.2.1
What happened?
The logs contained a lot of messages about out of order metrics for kube-apiserver histogram and a record named cluster_quantile:apiserver_request_duration_seconds:histogram_quantile.
What you expected to happen?
No out of order metrics from kube-apiserver should appear in the logs.
How to reproduce it?
If the problem is caused by the duplicate rules, it should happen everywhere by just installing the chart.
Enter the changed values of values.yaml?
No response
Enter the command that you execute and failing/misfunctioning.
Anything else we need to know?
We performed the upgrade of the chart in two different k8s clusters, and the error only happened in one of them. Same k8s version, v 1.20.5, in both clusters though.
As stated above, we deploy Helm charts using ArgoCD.
We're seeing this same error.
Seeing the same issue - curious @avamonitoring , Did you find a way to just block out the offending rules without forking the whole repo?
We're seeing it as well, and it's probably a consequence of https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/632
The kube-apiserver rules have been split in multiple groups, but when the rule files have been imported in this chart, the old "kube-apiserver.rules" file has not been removed, so its rules have been duplicated, causing the out of order error.
Setting .Values.defaultRules.rules.kubeApiserver to false fixes it for us (as it only disables "kube-apiserver.rules").
However with the default value this duplication exists, so I'll open a PR to remove the file.
As you look to fix this, it looks like not all of the "replacement" kube-apiserver-xxxx.rules files have the same sort of checking so they could be disabled/enabled from the helm values.yaml. For instance kube-apiserver-availability has a test like the below:
{{- if and (semverCompare ">=1.14.0-0" $kubeTargetVersion) (semverCompare "<9.9.9-9" $kubeTargetVersion) .Values.defaultRules.create .Values.kubeApiServer.enabled .Values.defaultRules.rules.kubeApiserverAvailability }}
But in kube-apiserver-histogram, no such conditional exists:
{{- if and (semverCompare ">=1.14.0-0" $kubeTargetVersion) (semverCompare "<9.9.9-9" $kubeTargetVersion) .Values.defaultRules.create }}
It doesn't exist in kube-apiserver-burnrate either.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
We're also seeing our clusters affected by this. I can confirm that setting .Values.defaultRules.rules.kubeApiserver to false fixes it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
is there any updated?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
Hello there!
k8s: 1.20 kube-prometheus-stack: 19.3.0
level=warn ts=2022-01-24T13:52:22.378Z caller=manager.go:651 component="rule manager" group=kube-apiserver-burnrate.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=1
I opened another issue (#1799) to discuss the duplicate rules, before I saw this one.
Is there a reason why we need to keep the old kube-apiserver.rules?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
I think this issue still applies.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
Shouldn't go stale.
We are seeing this still with the prometheus chart embedded in rancher-monitoring (rancher-monitoring-100.1.2+up19.0.3) After repeated errors in the logs the pod get OOMKILLED and restarted. Could this be related or is it a different issue?
I will try to deactivate the default rules, but then we will not get any metric from our KubeApi by default anymore and will have to define our own rules, correct?
@staedter I don't believe that the OOM is related. Also, of you disable the rules you won't have alerts, but you'll still have the metrics, so you only have to create them.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
Don't close it please.
I seem to have started running in to this issue as well with rancher-monitoring-100.1.2+up19.0.3. The thing is it seems to have happened over time, not right away. If I kill the prometheus pod, it will stop for like 12-24 hours but then return. Is there a fix?
For anyone still trying to resolve this issue, it is already fixed at #2076 Upgrade to helm chart version greater than 35.4.2 to resolve this.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
This issue is being automatically closed due to inactivity.
This issue is being automatically closed due to inactivity.