helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[kube-state-metrics] Strategy RollingUpdate causes 'duplicate' metrics

Open wiardvanrij opened this issue 6 months ago • 7 comments

Is your feature request related to a problem ?

Metrics from KSM have labels like instance and pod that are filled by the KSM pod in question. Which means that when one uses RollingUpdate, it happens that you get duplicate metrics. For example (ommited some labels for clarity:

kube_deployment_status_replicas_available{instance="x:8080"} 1
kube_deployment_status_replicas_available{instance="y:8080"} 1

and thus any query that doesn't utilises max by (pod and/or instance) will provide a 'bad' result.

I personally think the default should be recreate. - Yes, it will provide a gap in data during the time you make a new release, however the trade off is IMHO better to have a gap which doesn't really hurt, than having actual visual 'bad' data.

The screenshot here shows the difference

image

The spike is a restart with RollingUpdate, and the small gap is a restart with Recreate

Describe the solution you'd like.

Make recreate rollout strategy the default and give guidance on how to prevent it otherwise.

Describe alternatives you've considered.

NONE

Additional context.

No response

wiardvanrij avatar Dec 18 '23 17:12 wiardvanrij

Hi @wiardvanrij,

Thank you for opening this issue. While I understand your point of view, changing the default behaviour should be avoided when it's not a requirement. Users can already configure the update strategy that works best for their needs. For this reason, and because RollingUpdate is the default for Kubernetes, I would keep the current strategy.

If you want, we can still use your PR to improve the documentation in charts/kube-state-metrics/values.yaml.

Please refer to my comments in https://github.com/prometheus-community/helm-charts/pull/4096

dotdc avatar Dec 18 '23 20:12 dotdc

@dotdc Yea I understand.

If I may argue tho; I personally think the out of the box experience for users is going to be much better with Recreate. I looked at all default dashboards + rules that are created against KSM, and quite frankly they are "wrong". I.e. the point that during a new rollout, one would just get bad results in the panels.

RollingUpdate is indeed the default, which is also really the best solution, yet it does assume the application in question handles it. Which is sort of my point; KSM doesn't deal with that. I.e.;

image

So that just means all dashboards/panels need to alter their query or.. KSM should be able to run in HA. Potentially by omitting some labels that make the series unique / some other solution?

wiardvanrij avatar Dec 19 '23 14:12 wiardvanrij

p.s. Updated the MR to only reflect the doc update

wiardvanrij avatar Dec 19 '23 15:12 wiardvanrij

I actually had a conversation on the CNCF slack: https://cloud-native.slack.com/archives/C167KFM6C/p1702998798043709?thread_ts=1702895323.133249&cid=C167KFM6C

Basically the gist there is:

https://github.com/prometheus-community/helm-charts/blob/ef3fe76dbb88ce712eaae22e8e14a868d41df061/charts/kube-state-metrics/values.yaml#L35-L40

# Change the deployment strategy when autosharding is disabled
# updateStrategy: Recreate

So.. perhaps we should do something automatically:

if autosharding.enabled == false -> updateStrategy: Recreate

?

wiardvanrij avatar Dec 19 '23 15:12 wiardvanrij

That was an interesting thread, thanks for sharing it!

I think this kind of information is valuable in the values descriptions, but I'm still not convinced that we should change the default or make this value dynamic. In some cases, having duplicated data is better than no data, because you can correlate the information using another metric/query, which is probably what you did when debugging this issue. I also agree that a small gap will not hurt in most cases, but it could hide some valuable information in others, which could potentially make it way harder to solve something else (devil's advocate).

I think the discussion now is: is it relevant to make this setting dynamic or not. mrueg is probably on holiday, but I'll tag him next week to see what he thinks about this.

In the meantime, I wish you a happy New Year's Eve!

dotdc avatar Dec 30 '23 16:12 dotdc

cc @mrueg What do you think about this?

dotdc avatar Jan 08 '24 13:01 dotdc

Heh, well regarding:

In some cases, having duplicated data is better than no data

That's actually a 'funny' one, because while in essence I would agree, the truth is that if you look at any publicly available Grafana dashboard or even the alerting rules defined in this repo, none take that in account (there is no sum/max by it's unique identifier labels). Which is exactly what happend for us, as people somewhat 'panicked' due to seeing duplicate numbers (not just the duplication itself). Which just happens once you have a 'stat' that is grouped by some label (like cluster).

My point is; It would be absolutely true, if the queries then would account for that behaviour, which just isn't the 'market standard' at this point. For which I'm also somewhat surprised I'm the only one bringing it up now :D

P.s. I've updated the MR to just improve the docs like you asked, I also think that's a fine improvement regardless of what we pick further.

Thanks!

wiardvanrij avatar Jan 08 '24 14:01 wiardvanrij