cloud-platform Review ingress controller count

Background

We currently have our ingress controller replica count set to 12 for both default and modsec flavours of our deployments. We should review this count and consider whether we ought to scale back to down.

Proposed user journey

Approach

Which part of the user docs does this impact

Communicate changes

[ ] post for #cloud-platform-update
[ ] Weeknotes item
[ ] Show the Thing/P&A All Hands/User CoP
[ ] Announcements channel

Questions / Assumptions

Definition of done

[ ] readme has been updated
[ ] user docs have been updated
[ ] another team member has reviewed
[ ] smoke tests are green
[ ] prepare demo for the team

Reference

How to write good user stories

Oct 16 '23 13:10 sj-williams

https://www.nginx.com/blog/microservices-march-reduce-kubernetes-latency-with-autoscaling/

We want to scale on ~100 active processes per ingress. This is based on load testing that shows the controller starts to drop requests when we exceed this number

Mar 13 '24 12:03 jaskaransarkaria

https://nginx.org/en/docs/http/ngx_http_limit_conn_module.html https://github.com/kubernetes/ingress-nginx/issues/10032

For testing concurrent connections handled per ingress controller pod, when the number of concurrent connections increases the % timeout is more. And when number of replicas are increased the % of timeouts have reduced.

From prometheus, it is noticed the number of concurrent connections are shared when more than one replica is present.

With more than 1800+ ingress in live, there is approximately 1800+ servers/app running behind to handle the requests. The nignx controller has a limit of 100 concurrent connections per server as mentioned in the document https://nginx.org/en/docs/http/ngx_http_limit_conn_module.html

And refered in this issue: https://github.com/kubernetes/ingress-nginx/issues/10032

There is also maximum keep-alive requests 10000. Not all connections have keep-alive set.

		keepalive 320;
		keepalive_time 1h;
		keepalive_timeout  60s;
		keepalive_requests 10000;

http2_max_concurrent_streams    128;

Currently the default ingress controller process aroung 8k connections at a time with each pod handling around 200-300 connections.

The number of connections is also depends on the worker node cpu, how many concurrent process it can open with the instance etc. So there could be restrictions in how we placed the controller pods in a default worker nodes.

Recent investigation on % of success and have seen better success rate: https://mojdt.slack.com/archives/C514ETYJX/p1709907294599289?thread_ts=1709563385.828399&cid=C514ETYJX

From all the notes and investigations done, there is not one metric that will decide how many replicas of ingress controller is ideal. But there is a relation between how many connections a ingress controller pod can handle with lesser timeout. Currently, With that, we can keep the number of replicas to 30. But this metric needs reviewing again as the number of ingress and traffic increases in the future.

nginx_ingress_controller_nginx_process_connections{controller_class="k8s.io/ingress-default", state="active"}

May 10 '24 16:05 poornima-krishnasamy

TODO: add a info alert when the number of process connections is reached certain limit

May 22 '24 11:05 poornima-krishnasamy

cloud-platform cloud-platform copied to clipboard

Review ingress controller count

Background

Proposed user journey

Approach

Which part of the user docs does this impact

Communicate changes

Questions / Assumptions

Definition of done

Reference

cloud-platform
cloud-platform copied to clipboard