operator
operator copied to clipboard
vmagent: update sharding implementation
Currently, operator creates deployment/statefulset per shard. It introduces additional overhead for managing multiple entities. It's a legacy behavior.
Atm vmagent allows to read shard information from statefulset pod name (pod-0, pod-1,pod-2). And we have to use this approach for agent sharding.
How it would work:
spec.shardCount - configures a number of shards - statefulset.replicaCount=spec.shardCount, promscrape.cluster.membersCount=spec.shardCount.
spec.replicaCount - configures number of replicas for given shard - promscrape.cluster.replicationFactor=spec.replicaCount.
each pod get param promscrape.cluster.memberNum=pod-name-{0}.
There are three way to implement vmagent sharding under statefulset mode:
- the current way: for each part of the shard, we have an independent deployment/statefulset with according
-promscrape.cluster.membersCountarg. Thusvmagent.replicaCount -> statefulset replicas,vmagent.shardCount -> statefulset num, and vmagent promscrape.cluster.replicationFactor is not involved[can be configured by vmagent. extraArgs]. - proposed in this issue, create only one statefulset for each vmagent CR,
vmagent.shardCount -> statefulset replicas,vmagent.replicaCount -> promscrape.cluster.replicationFactor. - similar to 2 but leave vmagent.replicaCount -> num of workload and replicationFactor still be configured by extraArgs. thus
vmagent.shardCount -> statefulset replicas,vmagent.replicaCount -> statefulset num[reverse to p1]. p3 is proposed because with p2, we can't have multiple instances for each shard member with one vmagent CR anymore, this could be a breaking change for current users and could be requested by user.
For example, we have vmagent.replicaCount = 2, vmagent.shardCount = 3
Those three way will be like
Wth p2&p3, there are two more things need to be considered:
- podAffinity with p1, we gain ability to schedule pods with the same shard num deploying on the different node. But with p2&p3, each statefulset is a completed shard cluster, it will be hard to achieve that.
- update without data losing
If users want to upgrade vmagent without data missing, they can set
spec.replicas>1. With p1, data won't be missed cause there is more than one instance for each part of shard and they can do rolling update. With p3, it will be uncontrollable.
And p2&p3 will only be implemented under statefulset mode, deployment mode will keep using p1 cause random pod name can't be used as cluster.memberNum, that leads to inconsistence between two modes.
So I think the current way is actually more functional than the other two, and having the very same configuration for each replica under one workload maybe more understandable.
wdyt) @f41gh7 @Amper @hagen1778
plz cc @valyala
@f41gh7 @tenmozes @k1rk can we have your opinion here?
Possible cases, when affinity and advanced shard scheduling makes sense :
- shard-replica must be at different zones. Supported by (p1,p3).
- each shard must be at different host machine (no such requirements for replicas). Supported by (p1,p2,p3).
- any other use cases?
A possible advantage of using P2 - upgrades without data loss with replicas==1. But it works only with deployments and I believe it's not a production ready case.
So, I suggest to change current implementation. For case, when replicas must be at different zones user must create multiple VMAgent installation.
Additional note, if shard configuration is defined only StatefulSet must be used. Deployments will be no longer supported.
Lets implement p2.
proposed in this issue, create only one statefulset for each vmagent CR, vmagent.shardCount -> statefulset replicas, vmagent.replicaCount -> promscrape.cluster.replicationFactor.
It's a breaking change and we're fine with it. We have to mention it at our changelog.
Main motivation - use vmagent native mechanism for sharding (promscrape.cluster.replicationFactor isn't in use currently).
With enabled sharding ONLY Statefulset is supported ( statefulMode enabled implicitly).
FYI, @hagen1778 @Haleygo
Dropping p1 result in you can't ensure any more, that the replicas are not in same zone. With p2 not doable. With p2 the link in memberURLTemplate will get brocken, when one replica is down. Do you plan to produce propper svc`s for that?