kube-state-metrics Current status of sharding functionality

Hi, We are planning to take dependency of sharding capabilities of kube-state-metrics and have a few questions before we do so. What is the current status of sharding functionality? Is it still in preview or is it ready for production purposes? I see note that says "Sharding should be used carefully and additional monitoring should be set up in order to ensure that sharding is set up and functioning as expected (eg. instances for each shard out of the total shards are configured)." Is this for the considerations of network traffic and resource consumption only? Or does it mean that it can have potential issues and should be carefully monitored?

Also, I see To optimize this further, the Kubernetes API would need to support sharded list/watch capabilities. In the optimal case, memory consumption for each shard will be 1/n compared to an unsharded setup - Currently this doesn't exist. Is there any work happening in this stream to help with kube-state-metrics sharding?

Thanks in advance!

Update - Adding some more information about the limitation with sharding -

Sharding doesnt provde High Availability. If one of the deployment pods goes down, the resources from which it was supposed to have metrics available for is not reassigned to other running shard. The only way we can have HA with k-s-m is by having 2 instances of k-s-m deployment running, which leads to increased resource usage and there might be some inconsistencies because the service will route traffic to either deployment which might lead to metrics being out of sync between deployments. Ideal way to provide for HA would be to have an operator which looks for the active and healthy deployment and assigns metric targets dynamically.
Also, using sharding with static deployment would mean we need to know the number of instances before we deploy and doesnt provide for dynamic scale out and scale in.

Jun 19 '25 18:06 rashmichandrashekar

How many nodes and pods do you have?

Jun 20 '25 17:06 CatherineF-dev

We have our largest customer at ~2200 nodes and roughly 65k pods. But wont that also depend on that other resources that we collecting metrics for?

Jun 20 '25 18:06 rashmichandrashekar

Also, a followup on the sharding configuration itself. Is the only supported scenario to use horizontal sharding via automated sharding? Is running 2 deployment instances with shard=0 and shard=1 with total-shards=2 explicitly a supported scenario?

Jun 20 '25 19:06 rashmichandrashekar

@CatherineF-dev - Gentle Ping :) Could you pls get the questions answered?

Jun 23 '25 17:06 rashmichandrashekar

/assign @CatherineF-dev Catherine since you have already had a look at this, could you follow up on this please? Thanks!

Jun 26 '25 16:06 richabanker

@richabanker - Is there a sig meeting happening currently that i can join? The invite on the github repo doesnt seem to work. is there an updated one?

Jun 26 '25 16:06 rashmichandrashekar

ah missed this, yes the triage meeting happened today

The invite on the github repo doesnt seem to work.

oh are you referring to the invite here ?

Jun 26 '25 17:06 richabanker

Ah yeah its pointing to the wrong invite. The Agenda doc though has the right one.

Updating the one mentioned on https://github.com/kubernetes/community/tree/master/sig-instrumentation

Jun 26 '25 17:06 richabanker

@richabanker @rashmichandrashekar Was there any discussion on this? Or is it planned for the next meeting?

I also have similar concerns since we're looking to adopt sharding in our clusters. The primary reason being that the Prometheus instances scraping KSM data are sharded. Having only one shard of KSM leads to an uneven distribution of metrics.

Jul 03 '25 07:07 PrayagS

@PrayagS there wasn't any discussion on this afaik. Please feel free to add a new topic to the Agenda doc linked above and it should be discussed int he next SIG meeting. Thanks

Jul 08 '25 16:07 richabanker

/triage accepted

Jul 10 '25 16:07 dgrisonnet

@PrayagS there wasn't any discussion on this afaik. Please feel free to add a new topic to the Agenda doc linked above and it should be discussed int he next SIG meeting. Thanks

Thanks for getting back.

We actually ended up deploying manual sharding across all of our clusters last week. Requests to the API server have increased as expected but it's insignificant in comparison.

I'll reach out here or attend the meeting in case we face any issues in the coming months.

Jul 14 '25 00:07 PrayagS