Maximum Number of Services and Pods Supported
Motivation: Our production use case requires support for a very large number of services and instances
1. What we did
Environment Details:
- Kubernetes: 1.28 (CCE)
- OS: HCE 2.0
- Istio: 1.19
- Kmesh version: release 0.5
- CPU: 8
- Memory: 16 Gib
We started scaling up in batches of 1000 services using the below yaml file and command.
- yaml file
# svc.yaml
kind: Service
apiVersion: v1
metadata:
name: foo-service
labels:
foo: bar
spec:
clusterIP: None
selector:
app: foo
ports:
- port: 5678
- scaling up command
$ for i in $(seq 1 1000); do sed "s/foo-service/foo-service-0-$(date +%s-%N)/g" svc.yaml | kubectl apply -f -; done`
2. What we observed
At around 16k services Kmesh started emitting error logs (Attached) kmesh_error_logs.txt
3. Why we think this is an issue:
For Kmesh to be suitable for our use case, we need support for a much larger number of services and instances (50K+).
@lec-bit Did you figure out it?
I got the same errors after 16k services, We designed it based on 5000 services and 10w pod. https://github.com/kmesh-net/kmesh/issues/318#issuecomment-2114550669
What error do we first met?
Load Test For Maximum Number of Pods
We performed a load test using pilot-load to verify maximum number of pods.
Observations
- After running this load test for 400 pods, Kmesh logs showed
invalid next sizeerror. - After running the test with 100 pods, Kmesh did not have any error logs, but the bpf map only had entries of 35 of the 100 pods that were deployed
Environment
- Kubernetes: 1.28
- OS: OpenEuler 23.03
- Istio: 1.19
- Kmesh version: release 0.5
- CPU: 8
- Memory: 16 Gib
Steps To Reproduce Error for 400 pods
- Make sure you have Kmesh release-0.5 and Istio running in your cluster
- Clone the repo pilot-load. Since Istio is already running in your system, make sure you delete these lines from 42 to 47 from the deploy.sh script.
- Follow the steps under Getting Started to set up Pilot Load on your cluster.
- Once Pilot Load is set up, create a config map for 400 pods using the below config file and command.
- config.yaml
nodeMetadata: {}
jitter:
workloads: "110ms"
config: "0s"
namespaces:
- name: foo
replicas: 1
applications:
- name: foo
replicas: 1
instances: 400
nodes:
- name: node
count: 5
- command
kubectl create configmap config-400-pod -n pilot-load --from-file=config.yaml=svc.yaml --dry-run=client -oyaml | kubectl apply -f -
- In the file load-deployment.yaml, set
volumes.name.configMap.nametoconfig-400-podand thenkubectl apply -f load-deployment.yaml. It will take 1 to 2 minutes for all the mock pods to get deployed. - Check logs of the Kmesh pod running on your actual Kubernetes node. (Kmesh pods on mock nodes get stuck in pending state, which is expected since these nodes are mocked). You will see the below error.
Steps to Reproduce Error for 100 pods
- Perform steps 1-6 from previous section with config.yaml for 100 pods (step 5).
nodeMetadata: {}
jitter:
workloads: "110ms"
config: "0s"
namespaces:
- name: foo
replicas: 1
applications:
- name: foo
replicas: 1
instances: 100
nodes:
- name: node
count: 3
- Kmesh wont have any error logs, but the bpf map will not have all the pods that were deployed (in our test we only got 35).
Edit: Both the above tests were replicated multiple times by changing the deployment order; ie deploying the mock pods first and then Kmesh second. Here is what we observed:
- We still got the malloc error for 400 pods,
- We still got missing entries in bpf map but the number of missing entries was different in each rerun.
- The malloc error is sometimes worded differently. In most cases, we get
malloc(): invalid next size (unsorted), but in a few cases we also getmalloc(): mismatching next->prev_size (unsorted)
Attachments.
@nlgwcy @hzxuzhonghu This seems like a critical bug, can you take some time to look into the root cause
We found that it was caused by pointer out-of-bounds. The struct Endpoint__LocalityLbEndpoints length exceeded the maximum value of inner_map, 1300, so this issue occurred. When we create 200 pods in a service, the array 200*sizeof(ptr) > 1300. This problem can be avoided by manually adjusting the maximum value of inner_map. This problem caused the current specifications to limit the number of pods under a service. There can be more. If you need to create a larger cluster, you should spread out the pods. We will gradually optimize this point in the future. Now we limit a service to support a maximum of 150 pods.