kube-state-metrics fix(discovery): configure sharding every time MetricsHandler.Run runs

What this PR does / why we need it:

I see the following issue in a ksm deployment with custom CRD config enabled. Whenever a new CRD event is added to the cache, ksm stops watching and metrics stop reflecting new changes.

In the graph, ksm recovers after updating the Statefulset with any change. The metrics are rate(kube_state_metrics_watch_total) and the counter kube_state_metrics_custom_resource_state_add_events_total without rate.

Looking into the code, the problem seems to be the validation shardingUnchanged in AddFunc (here).

Without a CRD config, MetricsHandler.Run runs only once, and the vars m.curShard and m.curTotalShards are initially nil, which makes shardingUnchanged = false (here).

However, if a CRD config is present, discovery runs MetricsHandler.Run every time a CRD event is detected (here). If the Statefulset number of replicas/shards didn't change, the new CRD event will cancel the old metrics handler, but won't initiate a new one because shardingUnchanged = true in AddFunc.

~~This change removes the check shardingUnchanged in the AddFunc event handler. I don't think it's necessary because, in most cases, it's only called when the informer is synced at the end of MetricsHandler.Run.~~

This change updates CRDiscoverer.PollForCacheUpdates to reconfigure sharding in the already running metrics handler, instead of running a new one every time a CRD event occurs.

How does this change affect the cardinality of KSM:

No change in cardinality.

Which issue(s) this PR fixes:

Fixes https://github.com/kubernetes/kube-state-metrics/issues/2372

Aug 16 '24 17:08 waltherlee

Welcome @wallee94!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

Aug 16 '24 17:08 k8s-ci-robot

/assign @CatherineF-dev /triage accepted

Aug 22 '24 16:08 logicalhan

Hello, this code has been around for five years. Why is it only now experiencing this issue? Could there be a another underlying problem? If we can figure out how to only Run once inside https://github.com/kubernetes/kube-state-metrics/pull/1851/files, it will be the fix.

			m.mtx.RLock()
			shardingUnchanged := m.curShard == shard && m.curTotalShards == totalShards
			m.mtx.RUnlock()

			if shardingUnchanged {
				return
			}

Aug 24 '24 01:08 CatherineF-dev

Sorry, I mentioned it in a thread in the kube-state-metrics Slack channel and forgot to put it here.

The bug isn't exactly in metrics_handler if Run runs only once, which is how it works without CR. The issue is that discovery started running it multiple times in this PR if there's a CR config.

Removing the validation in AddFunc made sense to me to reconfigure every time the index informer is initially synced. I don't think it should depend on the previous m.curShard and m.curTotalShards values of earlier runs.

Aug 24 '24 01:08 waltherlee

The issue is that discovery started running it multiple times in this PR if there's a CR config.

Thanks for spotting this! I am thinking whether we should run it only once.

Aug 24 '24 01:08 CatherineF-dev

That's a good point, I can look into that. I think m.Run could probably be just m.ConfigureSharding, but discovery doesn't have access to the Statefulset or k8s client to calculate shard and totalShards.

On the other hand, if Run runs only once, the validation in this change is usually false because m.curShard and m.curTotalShards are initially 0.

Aug 24 '24 01:08 waltherlee

I've made some changes to use m.ConfigureSharding in discovery and to run m.Run only once. I'm testing it on a test cluster and so far it's working fine. I'll let it run for a few more hours and push it tomorrow if everything looks good

Aug 27 '24 02:08 waltherlee

@CatherineF-dev I added new changes to run m.Run only once when the pod starts, and to reconfigure sharding in discovery instead of recreating the IndexInformer. I also removed some of the contexts in PollForCacheUpdates because m.ConfigureSharding already cancels the previous context.

I deployed the change to a few clusters and it seems to be working. This is the watch rate after adding a new CRD to the cluster:

I see the event, then a brief drop, which is when ksm is populating the cache after the reconfigure, and then it comes back to normal. All the metrics look good as well.

Aug 27 '24 21:08 waltherlee

A summary of changes per file to help with the review:

internal/discovery/discovery.go: Use m.BuildWriters(ctx) instead of m.Run(olderContext) to rebuild writers instead of recreating the whole handler. BuildWriters cancels the old context, so we no longer need the cancelations in this file.
pkg/metricshandler/metrics_handler.go: adds a function BuildWriters that rebuilds the metrics writers. We use it in discovery.go when the resources in the StoreBuilder have changed.
tests/e2e/discovery_test.go: Update to test custom metrics after updating a CRD. I broke this into subfunctions because it failed the cyclomatic complexity linter.

Sep 27 '24 18:09 waltherlee

Is there an ETA for when this will get merged?

Oct 03 '24 11:10 mal-berbatov-ci

/lgtm

I'll still ping the other maintainers to review, currently everyone seems to be busy.

Oct 08 '24 08:10 mrueg

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, wallee94

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mrueg]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 08 '24 08:10 k8s-ci-robot

/hold cancel

continuing here as no other reviews came in.

Thanks for the debugging and your contribution!

Oct 15 '24 07:10 mrueg

/hold cancel

Oct 15 '24 08:10 mrueg