Motivation: Our production use case requires support for a very large number of services and instances

1. What we did

Environment Details:

Kubernetes: 1.28 (CCE)
OS: HCE 2.0
Istio: 1.19
Kmesh version: release 0.5
CPU: 8
Memory: 16 Gib

We started scaling up in batches of 1000 services using the below yaml file and command.

yaml file

# svc.yaml
kind: Service
apiVersion: v1
metadata:
  name: foo-service
  labels:
    foo: bar
spec:
  clusterIP: None
  selector:
    app: foo
  ports:
  - port: 5678

scaling up command

$ for i in $(seq 1 1000); do sed "s/foo-service/foo-service-0-$(date +%s-%N)/g" svc.yaml | kubectl apply -f -; done`

2. What we observed

At around 16k services Kmesh started emitting error logs (Attached) kmesh_error_logs.txt

3. Why we think this is an issue:

For Kmesh to be suitable for our use case, we need support for a much larger number of services and instances (50K+).

Oct 11 '24 02:10 tmodak27

@lec-bit Did you figure out it?

Oct 12 '24 01:10 hzxuzhonghu

I got the same errors after 16k services, We designed it based on 5000 services and 10w pod. https://github.com/kmesh-net/kmesh/issues/318#issuecomment-2114550669

Oct 17 '24 03:10 lec-bit

What error do we first met?

Oct 18 '24 02:10 hzxuzhonghu

What error do we first met?

kmesh_error_logs.txt

Oct 21 '24 17:10 tmodak27

Load Test For Maximum Number of Pods

We performed a load test using pilot-load to verify maximum number of pods.

Observations

After running this load test for 400 pods, Kmesh logs showed invalid next size error.
After running the test with 100 pods, Kmesh did not have any error logs, but the bpf map only had entries of 35 of the 100 pods that were deployed

Environment

Kubernetes: 1.28
OS: OpenEuler 23.03
Istio: 1.19
Kmesh version: release 0.5
CPU: 8
Memory: 16 Gib

Steps To Reproduce Error for 400 pods

Make sure you have Kmesh release-0.5 and Istio running in your cluster
Clone the repo pilot-load. Since Istio is already running in your system, make sure you delete these lines from 42 to 47 from the deploy.sh script.
Follow the steps under Getting Started to set up Pilot Load on your cluster.
Once Pilot Load is set up, create a config map for 400 pods using the below config file and command.

config.yaml

nodeMetadata: {}
jitter:
  workloads: "110ms"
  config: "0s"
namespaces:
- name: foo
  replicas: 1
  applications:
  - name: foo
    replicas: 1 
    instances:  400
nodes:
- name: node
  count: 5

command

kubectl create configmap config-400-pod -n pilot-load --from-file=config.yaml=svc.yaml --dry-run=client -oyaml | kubectl apply -f -

In the file load-deployment.yaml, set volumes.name.configMap.name to config-400-pod and then kubectl apply -f load-deployment.yaml. It will take 1 to 2 minutes for all the mock pods to get deployed.
Check logs of the Kmesh pod running on your actual Kubernetes node. (Kmesh pods on mock nodes get stuck in pending state, which is expected since these nodes are mocked). You will see the below error.

Steps to Reproduce Error for 100 pods

Perform steps 1-6 from previous section with config.yaml for 100 pods (step 5).

nodeMetadata: {}
jitter:
  workloads: "110ms"
  config: "0s"
namespaces:
- name: foo
  replicas: 1
  applications:
  - name: foo
    replicas: 1 
    instances:  100
nodes:
- name: node
  count: 3

Kmesh wont have any error logs, but the bpf map will not have all the pods that were deployed (in our test we only got 35).

Edit: Both the above tests were replicated multiple times by changing the deployment order; ie deploying the mock pods first and then Kmesh second. Here is what we observed:

We still got the malloc error for 400 pods,
We still got missing entries in bpf map but the number of missing entries was different in each rerun.
The malloc error is sometimes worded differently. In most cases, we get malloc(): invalid next size (unsorted), but in a few cases we also get malloc(): mismatching next->prev_size (unsorted)

Attachments.

kmesh-invalid-next-size_logs.txt

Oct 21 '24 18:10 tmodak27

@nlgwcy @hzxuzhonghu This seems like a critical bug, can you take some time to look into the root cause

Oct 22 '24 03:10 hzxuzhonghu

We found that it was caused by pointer out-of-bounds. The struct Endpoint__LocalityLbEndpoints length exceeded the maximum value of inner_map, 1300, so this issue occurred. When we create 200 pods in a service, the array 200*sizeof(ptr) > 1300. This problem can be avoided by manually adjusting the maximum value of inner_map. This problem caused the current specifications to limit the number of pods under a service. There can be more. If you need to create a larger cluster, you should spread out the pods. We will gradually optimize this point in the future. Now we limit a service to support a maximum of 150 pods.

Oct 29 '24 07:10 lec-bit

Maximum Number of Services and Pods Supported

Load Test For Maximum Number of Pods

Observations

Environment

Steps To Reproduce Error for 400 pods

Steps to Reproduce Error for 100 pods

Attachments.