helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[kube-prometheus-stack] kube-apiserver.rules Error on ingesting out-of-order result from rule evaluation

Open landerss1 opened this issue 4 years ago • 21 comments

Describe the bug a clear and concise description of what the bug is.

In a k8s cluster v 1.20.5, we deploy applications using ArgoCD and Helm charts. When we upgraded the kube-prometheus-stack chart 17.2.1, we started seing messages in the log of the form:

2021-08-19T09:24:53.993798106Z stderr F level=warn ts=2021-08-19T09:24:53.993Z caller=manager.go:617 component="rule manager" group=kube-apiserver.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=239

After a bit of investigation, we found that the rules in question were the ones defined in kube-apiserver-histogram.rules.yaml. The exact sames rules are also defined in kube-apiserver.rules.yaml. So, we tried to remove the rules from kube-apiserver-histogram.rules.yaml, and then re-deployed Prometheus, which made the error go away and no out of order messages appeared in the logs after that. I can see from the history that the file kube-apiserver-histogram.rules.yaml was created July 15th, and maybe the corresponding rules should have been removed from kube-apiserver.rules.yaml at that time to avoid duplicate rules?

What's your helm version?

version.BuildInfo{Version:"v3.5.1", GitCommit:"32c22239423b3b4ba6706d450bd044baffdcf9e6", GitTreeState:"clean", GoVersion:"go1.15.7"}

What's your kubectl version?

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.8", GitCommit:"fd5d41537aee486160ad9b5356a9d82363273721", GitTreeState:"clean", BuildDate:"2021-02-17T12:41:51Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:02:01Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Which chart?

kube-prometheus-stack

What's the chart version?

17.2.1

What happened?

The logs contained a lot of messages about out of order metrics for kube-apiserver histogram and a record named cluster_quantile:apiserver_request_duration_seconds:histogram_quantile.

What you expected to happen?

No out of order metrics from kube-apiserver should appear in the logs.

How to reproduce it?

If the problem is caused by the duplicate rules, it should happen everywhere by just installing the chart.

Enter the changed values of values.yaml?

No response

Enter the command that you execute and failing/misfunctioning.

Anything else we need to know?

We performed the upgrade of the chart in two different k8s clusters, and the error only happened in one of them. Same k8s version, v 1.20.5, in both clusters though.

As stated above, we deploy Helm charts using ArgoCD.

landerss1 avatar Aug 24 '21 07:08 landerss1

We're seeing this same error.

stevehipwell avatar Sep 07 '21 12:09 stevehipwell

Seeing the same issue - curious @avamonitoring , Did you find a way to just block out the offending rules without forking the whole repo?

davideshay avatar Oct 05 '21 14:10 davideshay

We're seeing it as well, and it's probably a consequence of https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/632

The kube-apiserver rules have been split in multiple groups, but when the rule files have been imported in this chart, the old "kube-apiserver.rules" file has not been removed, so its rules have been duplicated, causing the out of order error.

Setting .Values.defaultRules.rules.kubeApiserver to false fixes it for us (as it only disables "kube-apiserver.rules").

However with the default value this duplication exists, so I'll open a PR to remove the file.

thmslx avatar Oct 07 '21 16:10 thmslx

As you look to fix this, it looks like not all of the "replacement" kube-apiserver-xxxx.rules files have the same sort of checking so they could be disabled/enabled from the helm values.yaml. For instance kube-apiserver-availability has a test like the below: {{- if and (semverCompare ">=1.14.0-0" $kubeTargetVersion) (semverCompare "<9.9.9-9" $kubeTargetVersion) .Values.defaultRules.create .Values.kubeApiServer.enabled .Values.defaultRules.rules.kubeApiserverAvailability }}

But in kube-apiserver-histogram, no such conditional exists: {{- if and (semverCompare ">=1.14.0-0" $kubeTargetVersion) (semverCompare "<9.9.9-9" $kubeTargetVersion) .Values.defaultRules.create }} It doesn't exist in kube-apiserver-burnrate either.

davideshay avatar Oct 07 '21 18:10 davideshay

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Nov 07 '21 01:11 stale[bot]

We're also seeing our clusters affected by this. I can confirm that setting .Values.defaultRules.rules.kubeApiserver to false fixes it.

tongpu avatar Nov 16 '21 10:11 tongpu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Dec 16 '21 13:12 stale[bot]

is there any updated?

jjmengze avatar Dec 23 '21 05:12 jjmengze

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Jan 22 '22 08:01 stale[bot]

Hello there!

k8s: 1.20 kube-prometheus-stack: 19.3.0

level=warn ts=2022-01-24T13:52:22.378Z caller=manager.go:651 component="rule manager" group=kube-apiserver-burnrate.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=1

serhiiromaniuk avatar Jan 24 '22 13:01 serhiiromaniuk

I opened another issue (#1799) to discuss the duplicate rules, before I saw this one.

Is there a reason why we need to keep the old kube-apiserver.rules?

apricote avatar Feb 14 '22 07:02 apricote

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Mar 17 '22 09:03 stale[bot]

I think this issue still applies.

alexppg avatar Mar 22 '22 10:03 alexppg

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Apr 24 '22 13:04 stale[bot]

Shouldn't go stale.

alexppg avatar Apr 25 '22 07:04 alexppg

We are seeing this still with the prometheus chart embedded in rancher-monitoring (rancher-monitoring-100.1.2+up19.0.3) After repeated errors in the logs the pod get OOMKILLED and restarted. Could this be related or is it a different issue?

I will try to deactivate the default rules, but then we will not get any metric from our KubeApi by default anymore and will have to define our own rules, correct?

staedter avatar May 27 '22 07:05 staedter

@staedter I don't believe that the OOM is related. Also, of you disable the rules you won't have alerts, but you'll still have the metrics, so you only have to create them.

alexppg avatar May 30 '22 16:05 alexppg

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Jul 07 '22 00:07 stale[bot]

Don't close it please.

alexppg avatar Jul 07 '22 14:07 alexppg

I seem to have started running in to this issue as well with rancher-monitoring-100.1.2+up19.0.3. The thing is it seems to have happened over time, not right away. If I kill the prometheus pod, it will stop for like 12-24 hours but then return. Is there a fix?

tsrats avatar Aug 04 '22 17:08 tsrats

For anyone still trying to resolve this issue, it is already fixed at #2076 Upgrade to helm chart version greater than 35.4.2 to resolve this.

we10710aa avatar Sep 05 '22 09:09 we10710aa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Oct 12 '22 05:10 stale[bot]

This issue is being automatically closed due to inactivity.

stale[bot] avatar Oct 30 '22 15:10 stale[bot]

This issue is being automatically closed due to inactivity.

stale[bot] avatar Nov 22 '22 23:11 stale[bot]