design-cfps CFP-41953: add ClusterMesh Service v2

This CFP proposes introducing v2 of the ClusterMesh global service data format stored in etcd and a transition from cilium/state/services/v1/ to cilium/state/services/v2/.

The benchmarks in this CFP is based on this (wip/poc) code: https://github.com/MrFreezeex/cilium/tree/bench-clusterservice/bench

Oct 01 '25 14:10 MrFreezeex

cc @cilium/sig-clustermesh

Oct 01 '25 14:10 MrFreezeex

Hi everyone :wave: sorry for the wait, I finally found time to investigate a bit more here and pushed the second iteration of this CFP.

You can find more details on the commit description but apart from addressing some points raised by @giorio94 (thanks again!). I realized that it is more efficient to encode the address with a string type if compressing via ztsd while the uncompressed data is heavier. And the second important point is about the debug-ability by allowing to dump the data in json/yaml and have some hive script command to test with the JSON format. Those would most likely makes some generic kvstore code to have some special condition for those which doesn't look that great but at least the UX would be good. If you have some suggestion about that (and other things ofc) feel free!

Nov 02 '25 18:11 MrFreezeex

side note: I think there are multiple valid concerns that this CFP describes, but it's a bit unclear to me what is the user story that we are trying to solve. Within motivation sections we have:

No backends conditions for load balancing - I think this needs a bit more description why we need them, what case do we want to improve/solve with it (maybe there is another way?)
Simplification of EndpointSliceSync logic by adding EndpointSlice name
1.5 MiB limit in Etcd that limits size of service to less then ~10k backends if we extend ClusterService with the above proposals
Reducing in-cluster network traffic between Cilium Agent and Clustermeshe's Etcd.

I thought about it a little bit more and what I am concerned mostly about added complexity. I do agree that reduction in network bandwidth is nice, lowering likelihood of hitting 1.5 MB limit is improvement as well.

However, when I think about it from high-level design perspective I would start with following questions:

Why do we do it only for services and not all other resources? Having the same mechanism (whatever we choose, like compression or protobuf) for other resources would decrease complexity of the entire solution. I would really prefer not to have protobuf for one resource and json for other resources as that would definitely increase complexity and it would be harder to maintain.
I do believe we are mixing (just a little bit) two different problems together. One is size of events / network bandwith while other is rate of events (part "Add backend conditions (or state) to ClusterMesh Services"). I think these two problems may have two different solutions that are independent. I do not think your proposed solution fully addresses problem of increased rate of events with added backend conditions. For large services, we can expect constant churn of backends (crashing/restarting pods etc). Reducing amount of data that is transferred is definitely improvement, but I would be actually also concerned with Cilium Agent ability to reconcile it for large services.
If we are concerned about 1.5 MB limit, we are improving it, but not solving it. We could actually go totally different way of capping total number of endpoints exposed or introducing slicing. Note that default size of eBPF map for service backends is 65k entries across all clusters for all services in total. I would expect this limit to be affecting users more often rather than single service size.

For backend conditions part itself, why can't we have per service condition and filter that before writing them to Etcd? AFAIK, high-level routing decision for services is:

route to ready backends
if no ready backends, route to terminating backends
if no terminating and ready backends, route to non ready backends

It seems to me that if one cluster exposes both ready and terminating backends, remote clusters would actually only use ready backends, or did I miss something? That would mean we can just expose terminating or ready backends only instead :thinking: ?

Some alternative proposals:

Slice backends just like EndpointSlice + maybe additional compression for all reasources. Just slicing without compression reduced amount of data transferred significantly + reduces overhead of processing within Cilium Agent (less backends to update/process per event etc). Also, if we go with compression, less data to decompress/lower overhead per event etc.
Cap max number of backends per Service (for example like 1k) + maybe additional compression for all resources. I actually somehow like capping max number of backends per Service even though it doesn't sounds that great initially. By default, eBPF maps responsible for Service Backends have size of ~65k entries.

I do not think considering 10k or 50k backends of service makes sense without addressing eBPF map sizing as well :thinking:

Dec 02 '25 11:12 marseel

Hi @marseel :wave: thanks for looking into this!

Why do we do it only for services and not all other resources? Having the same mechanism (whatever we choose, like compression or protobuf) for other resources would decrease complexity of the entire solution. I would really prefer not to have protobuf for one resource and json for other resources as that would definitely increase complexity and it would be harder to maintain.

The main reason is that it's the only resources with unbounded size of entries in it/unbounded size. It's also a specific resources to clustermesh, IIRC the only other resource only used for clustermesh (and not kvstore in general) is the serviceexport one specific to the MCS-API implementation so the fact that it's not used by kvtore should simplifies a bit the transition. That being said I agree that having one encoding mechanism would be best in general if we would be able to.

I do believe we are mixing (just a little bit) two different problems together. One is size of events / network bandwith while other is rate of events (part "Add backend conditions (or state) to ClusterMesh Services"). I think these two problems may have two different solutions that are independent. I do not think your proposed solution fully addresses problem of increased rate of events with added backend conditions. For large services, we can expect constant churn of backends (crashing/restarting pods etc). Reducing amount of data that is transferred is definitely improvement, but I would be actually also concerned with Cilium Agent ability to reconcile it for large services.

Right my main point in the CFP regarding rate of events is by saying that decompressing/preparing for statedb insertion is two times faster with protobuf, but that might indeed be not sufficient. I was thinking about having a "buffer" like the Cilium loadbalancer code is doing when reading EndpointSlice/Service change but on the export side instead. IIRC the existing buffer before importing is 500 events or 500ms before taking those changes into account and essentially coalescing multiple updates. If we were to do something equivalent when exporting the "ClusterService v2" it should make this more approachable I think ~. I thought about this recently and was thinking that it could be something done separately from this CFP/later on, but I could add it here too!

EDIT: also note that I haven't benchmarked it but I think there are ways to optimize further the insertion of cluster services into statedb which might push further than two time faster the decompression/insertion into statedb vs currently.

If we are concerned about 1.5 MB limit, we are improving it, but not solving it. We could actually go totally different way of capping total number of endpoints exposed or introducing slicing. Note that default size of eBPF map for service backends is 65k entries across all clusters for all services in total. I would expect this limit to be affecting users more often rather than single service size.

It kind of solves it because the compression makes the object really smaller and you won't conceivably not attain this limit within what Kubernetes/Cilium support, I wasn't aware of the 65k backends limit and was more counting about Kubernetes supporting a max of 150k backend overall. But indeed accounting for the 65k backends overall, a theoretical limit around 10k could be just ok. My angle is more how to add the new fields while not regressing in a way we don't want to than adopting precisely protobuf so I would be more than happy with alternative solutions as well (or if the limit with the added field is deemed ok we could even just add those fields now and optimize it later?)!

For backend conditions part itself, why can't we have per service condition and filter that before writing them to Etcd? AFAIK, high-level routing decision for services is:

route to ready backends

if no ready backends, route to terminating backends

if no terminating and ready backends, route to non ready backends

It seems to me that if one cluster exposes both ready and terminating backends, remote clusters would actually only use ready backends, or did I miss something? That would mean we can just expose terminating or ready backends only instead 🤔 ?

The high-level routing seems correct, not entirely sure about if the last point is correct "if no terminating and ready backends, route to non ready backends" I would have maybe think that it would never route new connection to non ready backend but I don't know whether it does that or not. IIUC there is also something about not removing existing entries in terminating/non ready for existing connection which might prevent us from removing non ready backends. It would also removes those from EndpointSliceSync but this part might just be ok!

I do not think considering 10k or 50k backends of service makes sense without addressing eBPF map sizing as well 🤔

Indeed! Although it might be simpler to make the ClusterMesh almost unbounded and that the eBPF map would be the only limit though. There might be scenario where going higher than 10k on one service on a cluster in your mesh while the rest don't go over 65k overall. I agree that this looks unlikely though :sweat_smile:. This seems that eBPF map can be easily tuned by the user so we could just add something in the documentation that in case you have more than 65k backends in all your mesh you should consider bumping this limit :thinking: (and it would most likely be even worthwhile to do this with the current documentation/not part of this CFP!).

Some alternative proposals:

Slice backends just like EndpointSlice + maybe additional compression for all reasources. Just slicing without compression reduced amount of data transferred significantly + reduces overhead of processing within Cilium Agent (less backends to update/process per event etc). Also, if we go with compression, less data to decompress/lower overhead per event etc.

Cap max number of backends per Service (for example like 1k) + maybe additional compression for all resources. I actually somehow like capping max number of backends per Service even though it doesn't sounds that great initially. By default, eBPF maps responsible for Service Backends have size of ~65k entries.

1K might be a bit too small but we could indeed have a hard and properly defined limit per service per cluster! And I am open to the slice approach too!

Dec 02 '25 12:12 MrFreezeex

Thanks @MrFreezeex for iterating on the CFP, and sorry for the log delay.

I think there are multiple valid concerns that this CFP describes, but it's a bit unclear to me what is the user story that we are trying to solve. [...] I am concerned mostly about added complexity

I personally second this. Overall your proposal makes sense to me, but I'm struggling a bit at finding a clear motivation that justifies the increased complexity and risk of regression intrinsic in a significant change like this.

I would really prefer not to have protobuf for one resource and json for other resources as that would definitely increase complexity and it would be harder to maintain.

Agreed. I'm personally not a big fan of protobuf in this context as it introduces the requirement of knowing the schema to be able to decode the content from etcd, which is not the case currently.

IIRC the only other resource only used for clustermesh (and not kvstore in general) is the serviceexport one specific to the MCS-API implementation so the fact that it's not used by kvtore should simplifies a bit the transition. That being said I agree that having one encoding mechanism would be best in general if we would be able to.

This is simpler if we only want to switch to compressing the json objects, as we could easily support both options, with the plain uncompressed version being still used in kvstore mode, and the compressed one by clustermesh-apiserver and kvstoremesh, making use of capabilities to decide the format. The logic to handle compression/decompression should be fairly trivial anyways. I wouldn't change all resources to protobuf instead, as keeping both variants at the same time is likely to add way more complexity (and the risk of divergences would grow over time).

For large services, we can expect constant churn of backends (crashing/restarting pods etc). Reducing amount of data that is transferred is definitely improvement, but I would be actually also concerned with Cilium Agent ability to reconcile it for large services.

Right my main point in the CFP regarding rate of events is by saying that decompressing/preparing for statedb insertion is two times faster with protobuf, but that might indeed be not sufficient.

I wonder if a practical way to approach this could be to extend the cmapisrv-mock component to better simulate service and backend churn, and then run a few scale tests with that, similarly to what we already do in https://github.com/cilium/cilium/blob/main/.github/workflows/scale-test-clustermesh.yaml. That would help to get a better understanding on where bottleneck possibly are, and how much impact the actual parsing of json objects is. Maybe we discover that the bottleneck is somewhere else, and simply switching to protobuf would not help. Or maybe we see that it is the actual problem, which would be a stronger motivation for going in that direction.

IIRC the existing buffer before importing is 500 events or 500ms before taking those changes into account and essentially coalescing multiple updates.

This is definitely a possibility that makes sense to explore and validate with the scale tests. We already have some intrinsic coalescencing given the back pressure introduced by etcd client rate limiting (both in the clustermesh-apiserver and at the kvstoremesh level), but an explicit rate limiting for that may help even further.

Dec 03 '25 16:12 giorio94

For backend conditions part itself, why can't we have per service condition and filter that before writing them to Etcd? AFAIK, high-level routing decision for services is:
route to ready backends
if no ready backends, route to terminating backends
if no terminating and ready backends, route to non ready backends
It seems to me that if one cluster exposes both ready and terminating backends, remote clusters would actually only use ready backends, or did I miss something? That would mean we can just expose terminating or ready backends only instead 🤔 ?

I think (but I'm not totally sure) that in general the reason for keeping the terminating backends in endpointslices is to preserve already existing connections to them until fully drained, while having new connections be load balanced to ready backends only. From this point of view the current behavior of global services is suboptimal, because we don't respect them. OTOH, it is a bit unclear to me whether adding that support would justify the extra churn, considering that the propagation of information may be (possibly significantly) delayed anyways especially in high scale environments. Again, user feedback and/or scale testing may help to provide guidance here.

Dec 03 '25 17:12 giorio94

I personally second this. Overall your proposal makes sense to me, but I'm struggling a bit at finding a clear motivation that justifies the increased complexity and risk of regression intrinsic in a significant change like this.

I think one of the main motivation is that there is some fields that would be quite nice to add and that we are in a situation where we can't really comfortably add those fields (and any fields that are per backend in general) without having some important regressions related to the rate of updates by adding conditions, increasing the object size and hitting the 1.5MiB per object limit and overall the bandwith needed which would be severely impacting by both rate of updates and object size. Even without accounting those new fields to be added, improving some of those parameters would most also be a motivation for this!

Also related to object format, our current ClusterService struct seems fitted to how Cilium works internally when ClusterMesh was first introduced (I think?) and we might be better of fitting to something that looks more to Kubernetes resources "struct" so that we should be safe from any future refactoring from the load balancer structs as those are only used internally. This would ensure we can adopt any important new fields in the EndpointSlice struct while just having a similar conversion to the Cilium load balancer internal struct as what is Cilium is doing when reading from the kube-apiserver.

This is simpler if we only want to switch to compressing the json objects, as we could easily support both options, with the plain uncompressed version being still used in kvstore mode, and the compressed one by clustermesh-apiserver and kvstoremesh, making use of capabilities to decide the format.

Hmm maybe yes, we could even have some logic to detect if something is compressed or not and somehow transparently decompress it or use it as is (this only a food for thought, might not be ideal in our case). I am not sure if it's worth compressing any other objects as I think they are all rather small uncompressed anyway? Maybe for CES if it makes its way to the kvstore with a format containing multiple endpoints, I don't think there is an initiative started for this (and no idea if it's something relevant too).

In case we actually change the service format / do a service v2 it would be relatively easy to have the v2 data compressed (independently about using protobuf or json). In case we want to do the same for the rest of the objects we would most likely need to check whether it's worth or not with smaller objects :thinking:. It could even be a separate initiative ~.

I wonder if a practical way to approach this could be to extend the cmapisrv-mock component to better simulate service and backend churn, and then run a few scale tests with that, similarly to what we already do in https://github.com/cilium/cilium/blob/main/.github/workflows/scale-test-clustermesh.yaml. That would help to get a better understanding on where bottleneck possibly are, and how much impact the actual parsing of json objects is. Maybe we discover that the bottleneck is somewhere else, and simply switching to protobuf would not help. Or maybe we see that it is the actual problem, which would be a stronger motivation for going in that direction.

Yep indeed that sounds like it would be worth to test something like this! I am wondering if it won't be simpler to test with two actual clusters and just creating a service with like 5k endpoints or 10k and change some small things at a high rate to simulate pods failing readiness check and checking how the agent behave and most likely some profiling info there. Integrating something like that sounds better long term but it might be simpler for me who doesn't know too much about the existing scale test while quickly discovering if there is a bottleneck here. Would be nice to have a test like this integrated in the scale test long term ofc though!

I think (but I'm not totally sure) that in general the reason for keeping the terminating backends in endpointslices is to preserve already existing connections to them until fully drained, while having new connections be load balanced to ready backends only.

Yep I understood that too! I don't know what ready=false means vs terminated=false means though but this is most likely what @marseel described in its message (or at least very close to that :sweat_smile:!).

OTOH, it is a bit unclear to me whether adding that support would justify the extra churn, considering that the propagation of information may be (possibly significantly) delayed anyways especially in high scale environments. Again, user feedback and/or scale testing may help to provide guidance here.

If I didn't missed any code that would ignore those, the current behavior we just do not filter out any ready=false or terminated=true and treat it like a fully active backends. So this means that even for very low scale deployment, you might have some deployment in a cluster where pods never became ready and we would treat it as ready pods in other clusters. This seems a pretty big deal and break many assumptions around Services handling :/. Ideally if we were to run some kubernetes net conformance test over 2 clusters somehow (which sounds hard and most likely not be happening) we should aim to theoritically pass it and IMO if our architecture can't scale with those assumption we should try to reasonably change something to fix that!

I am not convinced that most users are aware of the current limitation at all. We most likely have somes users that are not aware of this and would consider this limitation a "deal breaker" for Cluster Mesh!

So action items for me:

Discover if having high rate churn of large service JSON object is a problem currently or not
If yes check to what point we currently coalesce and determine if export coalescing would help

It seems both of you are not convinced by protobuf so it would most likely come down to some subset of the following things if we keep all backends in one object:

Decide if we want to add conditions in backends (and endpointslice name)
Decide if we want to change the format of the ClusterService (most likely something similar to what this CFP is proposing but encoded in JSON instead of protobuf)
Decide if we want to compress the object (and if we include in the scope all other type of resources or not)
Decide if we just hard limit the number of Endpoints to something that would work with our architecture
If we change something major (new struct format/compression) decide how to handle the transition

And there is also the slice approach that would solve some of the points above but introduce other things (like buffering when reading the slice mainly).

Let me know your thought on that, there's most likely some of the points above that we can discuss before that I get the data of service churn and coalescing!

Dec 03 '25 23:12 MrFreezeex

I think one of the main motivation is that there is some fields that would be quite nice to add and that we are in a situation where we can't really comfortably add those fields (and any fields that are per backend in general) without having some important regressions related to the rate of updates by adding conditions, increasing the object size and hitting the 1.5MiB per object limit and overall the bandwith needed which would be severely impacting by both rate of updates and object size. Even without accounting those new fields to be added, improving some of those parameters would most also be a motivation for this!

Sure, I agree with all of these reasons in theory. I think we are just lacking enough data at the moment to confirm whether (and which) would be a blocker, and the amount of benefits we could achieve changing the format. Basically justifying the extra complexity and risk of regressions with clear advantages. I'd personally suggest to focus mostly relatively low (e.g., <25 backends), medium (e.g., 25-250 backends) and high (e.g., 250-2500 backends) scale, and not that much on super-high scale (e.g., > 5k backends) in a first phase.

I am not sure if it's worth compressing any other objects as I think they are all rather small uncompressed anyway?

I agree there would not be a clear advantage in compressing the other objects, given that they are fairly small (and their size is bounded). If we were to do that, it would be mostly for consistency (depending on the implementation it may be convenient to do the same for all resources), and not really for the size benefits (assuming that the overhead would be negligible, otherwise that would be probably off the table).

Maybe for CES if it makes its way to the kvstore with a format containing multiple endpoints, I don't think there is an initiative started for this [...].

Yep, not that I'm aware of.

Yep indeed that sounds like it would be worth to test something like this! I am wondering if it won't be simpler to test with two actual clusters and just creating a service with like 5k endpoints or 10k and change some small things at a high rate to simulate pods failing readiness check and checking how the agent behave and most likely some profiling info there. Integrating something like that sounds better long term but it might be simpler for me who doesn't know too much about the existing scale test while quickly discovering if there is a bottleneck here. Would be nice to have a test like this integrated in the scale test long term ofc though!

I don't have a strong preference, it could be an extension of the cmapisrv-mock component (it already supports mocking service and backend objects here, but it would likely need some tuning and also require creating the service objects on the main cluster --- we can chat about how this works offline in case), or something else (for instance the loadbalancer package has some benchmarks that could be interesting to look into). IMO the most important part is that they are reproducible so that we can keep using them in the future to track the improvements (or possible regressions), and that they perform the E2E operations to track all bottlenecks (at least in the context of a single component --- agent or clustermesh-apiserver).

the current behavior we just do not filter out any ready=false or terminated=true and treat it like a fully active backends. [...] This seems a pretty big deal and break many assumptions around Services handling

Ah, right, I didn't consider that one. Yep, I agree this is definitely a bigger limitation compared to the handling of the terminating condition, and we should improve on that. We may be able get pretty far with filtering at the source as Marcel mentioned though (one case to pay attention to is if the service has publishNotReadyAddresses set). Probably not as perfect as propagating all conditions (which we may still want to do longer term), but it also requires very little changes.

Dec 04 '25 08:12 giorio94

I ran a small, hacky benchmark which is not that easily repeatable (like less easy than running go test or a CI for sure) but helped discover a few things. I created one global service and 5k "dummy" backends/50 full EndpointSlices (with IPs that do not point to anything and are not linked to any pods). Each endpoint has 2 ports and single stack IPv4, which resulted in the ClusterService weighing ~450KiB (note that I don't think there was any zone info in my kind cluster). The setup was a pretty standard kind-clustermesh dev setup on my laptop.

I also had a small script that wrote a specific label with a counter every 100ms to force the ClusterService to be regenerated and for the cilium-agent to watch it, trigger the full JSON decoding/statedb insert, and simulate some churn that could be caused by readiness. I also added some debug prints with the service name and my service label counter in operator/watchers/service_sync.go and in pkg/clustermesh/service_merger.go in MergeExternalServiceUpdate.

I noticed that the agents were processing each update sequentially and that processing was slower than getting the next updates, which means the agent was gradually lagging behind. It looks like we would need to add a queue or buffer before somewhere in case we want to coalesce updates (and not only rely on etcd rate limit).

I collected some pprof from one agent pulling the clustermesh services:

https://pprof.me/7d6eea6f917dd43596fa7e8d4bc7691d (cpu)
https://pprof.me/422106b12ad68d6d63de11c6107c4da6 (memory)

The most significant processing is in statedb. I compared the codepath from the k8s reflector https://github.com/cilium/cilium/blob/9beb3286c4d2a64fe6fbd0c197d2483b9ff76307/pkg/loadbalancer/reflectors/k8s.go#L340 vs clustermesh https://github.com/cilium/cilium/blob/9beb3286c4d2a64fe6fbd0c197d2483b9ff76307/pkg/clustermesh/service_merger.go#L84. It seems the main differences are that in the k8s reflector there's something that keeps previous backends outside statedb and cleans them before insertion, while in clustermesh we orphan backends afterwards by "querying" statedb. The k8s reflector also uses sequences while in clustermesh that's not the case (and it seems that the previous point helps with that). Those two things might help with memory usage (and also CPU-wise, since ~25% of the agent time in my "benchmark" is from GC). The k8s reflector also writes a single transaction for multiple service updates with its buffer that triggers upserts only after 500ms or 500 services updated, but the Commit functions only account for 2.43% of my CPU pprof so it might not be that important (maybe it affects other things in the flamegraph though?). So it seems we have some possibilities to improve the statedb usage, I think? I'm not really an expert in this codepath though, and I might not have understood everything. Anyway, this is almost unrelated to this CFP but at least helps with understanding where some of our bottlenecks are for very large services. The k8s reflector code also uses the EndpointSlice name to compare the previous and current backend state, it could probably be done a bit differently but it could mean having more similar code (or common code even?) and having this might thus help in other path than EndpointSliceSync.

JSON decoding is also responsible for about 11.44% directly, which is about 1/5 of statedb processing. So while it's true that it isn't the most impactful thing it isn't small either and it might also be a bit more significant if we optimize statedb usage. I think our current concurrency model is that we have essentially one goroutine per cluster which should make JSON decoding concurrent per cluster and might make it less important to fix and even there is the new encoding/json/v2 which is apparently significantly faster that also make it less important.

About concurrency, IIUC there are locks when writing to statedb, so it might not help that much outside of decoding unfortunately (and would explain why the k8s reflector is single-threaded). Without much coalescing, it seems quite feasible to get into a state where you would lag behind if you have multiple large (like ~1000 backends) services having some churn at the same time.

So overall, I think that we can't really handle large service with a good amount of churn (in my test 10 per seconds) currently, and whether we are filtering out ready=false backends or introducing conditions shouldn't change churn that much I think? This is not something that can't be fixed though, but it seems that we would need some optimizations before doing any of that (at least some coalescing) while having proper conditions should make us fully conformant to drain connections.

On the slice vs non slice front, I think it would mainly be a question of where do we want to coalesce. If we have slices, it means we can't really coalesce meaningfully on the export path. While if we don't have slices, we can easily coalesce on the export path which would also help with the etcd rate limit and network bandwidth directly (if we choose to compress, network bandwidth might not be not that meaningful though). For both we can also coalesce at the import path to have single statedb commit for multiple services I don't know if that matters a lot or if the rest of the statedb usage optimization would be already enough :thinking:. Also in case we want to coalesce in both export and import we might have to be careful about not introducing too much latency (like waiting 500ms on export and 500ms on import could be too much). Regarding EndpointSliceSync, queuing every EndpointSlice from a Service on a non slice approach should be just fine so it might not be super meaningful there.

EDIT: actually the export side is fine since the etcd rate lmit and the workqueue on the syncstore should effectively coalesce things!

Dec 05 '25 01:12 MrFreezeex

Ok! So I made a proper benchmark available here https://github.com/MrFreezeex/cilium/tree/bench-clustermesh/pkg/clustermesh/benchmark. I based it on the loadbalancer benchmark (thanks @giorio94 for the pointer) so that I can more easily compare the loadbalancer k8s reflector and clustermesh and also run it more easily than having actual kubernetes clusters. It confirms some of the point I discovered with the hacky benchmark above!

Following these new insights and our discussion I rewrote almost the entire CFP, it now focuses a lot more on aligning with the loadbalancer k8s reflector as there are some major performance difference in our current statedb usage. Let me know what do you think about the new approach!

Dec 07 '25 15:12 MrFreezeex

Ok! So I made a proper benchmark available here https://github.com/MrFreezeex/cilium/tree/bench-clustermesh/pkg/clustermesh/benchmark. I based it on the loadbalancer benchmark (thanks @giorio94 for the pointer) so that I can more easily compare the loadbalancer k8s reflector and clustermesh and also run it more easily than having actual kubernetes clusters. It confirms some of the point I discovered with the hacky benchmark above!

Thanks! I haven't managed to look into all the details yet, but I think it would be valuable to get some profiling data from the benchmarks, and then involve @joamaki in the discussion if the statedb usage turns out being the biggest bottleneck in this context.

Dec 12 '25 17:12 giorio94

Here are two pprof collected by the new clustermesh benchmark for 1000 bakends and 5000 backends:

https://pprof.me/7f52873b4c4f4294ffc5f428974291e4/
https://pprof.me/6e25bd135ba42dfe020839e9c0c0f8d0/

Dec 15 '25 11:12 MrFreezeex