cloud-platform
cloud-platform copied to clipboard
Review ingress controller count
Background
We currently have our ingress controller replica count set to 12 for both default
and modsec
flavours of our deployments. We should review this count and consider whether we ought to scale back to down.
Proposed user journey
Approach
Which part of the user docs does this impact
Communicate changes
- [ ] post for #cloud-platform-update
- [ ] Weeknotes item
- [ ] Show the Thing/P&A All Hands/User CoP
- [ ] Announcements channel
Questions / Assumptions
Definition of done
- [ ] readme has been updated
- [ ] user docs have been updated
- [ ] another team member has reviewed
- [ ] smoke tests are green
- [ ] prepare demo for the team
Reference
https://www.nginx.com/blog/microservices-march-reduce-kubernetes-latency-with-autoscaling/
We want to scale on ~100 active processes per ingress. This is based on load testing that shows the controller starts to drop requests when we exceed this number
https://nginx.org/en/docs/http/ngx_http_limit_conn_module.html https://github.com/kubernetes/ingress-nginx/issues/10032
For testing concurrent connections handled per ingress controller pod, when the number of concurrent connections increases the % timeout is more. And when number of replicas are increased the % of timeouts have reduced.
From prometheus, it is noticed the number of concurrent connections are shared when more than one replica is present.
With more than 1800+ ingress in live, there is approximately 1800+ servers/app running behind to handle the requests. The nignx controller has a limit of 100 concurrent connections per server as mentioned in the document https://nginx.org/en/docs/http/ngx_http_limit_conn_module.html
And refered in this issue: https://github.com/kubernetes/ingress-nginx/issues/10032
There is also maximum keep-alive requests 10000. Not all connections have keep-alive set.
keepalive 320;
keepalive_time 1h;
keepalive_timeout 60s;
keepalive_requests 10000;
http2_max_concurrent_streams 128;
Currently the default ingress controller process aroung 8k connections at a time with each pod handling around 200-300 connections.
The number of connections is also depends on the worker node cpu, how many concurrent process it can open with the instance etc. So there could be restrictions in how we placed the controller pods in a default worker nodes.
Recent investigation on % of success and have seen better success rate: https://mojdt.slack.com/archives/C514ETYJX/p1709907294599289?thread_ts=1709563385.828399&cid=C514ETYJX
From all the notes and investigations done, there is not one metric that will decide how many replicas of ingress controller is ideal. But there is a relation between how many connections a ingress controller pod can handle with lesser timeout. Currently, With that, we can keep the number of replicas to 30. But this metric needs reviewing again as the number of ingress and traffic increases in the future.
nginx_ingress_controller_nginx_process_connections{controller_class="k8s.io/ingress-default", state="active"}
TODO: add a info alert when the number of process connections is reached certain limit