Degraded performance (latency, connection errors) when workload is in Istio Ambient (ztunnel) and protected by GlobalNetworkPolicy
Hey folks 👋🏼 I'm cross-referencing an issue I opened for Istio ztunnel (ambient proxy): https://github.com/istio/ztunnel/issues/1666
As I described in that issue, it wasn't at first apparent what was causing the performance issues we were observing, however we've since realized that it was the combination of using ztunnel (Istio Ambient) & Calico GlobalNetworkPolicy
Expected Behavior
Using a GlobalNetworkPolicy should not affect the performance & reliability of traffic to the workload it selects.
Current Behavior
When the workload (NGINX Ingress Controller) is protected by a GNP, we are seeing high rate of connection errors & latency spikes.
Possible Solution
No solution was found, however the issue is mitigated when:
- The GNP is removed
- The workload (ingress controller) is removed from the Ambient mesh
- The source workload is in a different namespace than the ingress controllers (this is not 100% confirmed, I'm still testing, but seem to be the case)
Steps to Reproduce (for bugs)
The referenced ztunnel issue provides all the details about our test setup and how to reproduce a similar environment.
Locust (running in-cluster) --> AWS NLB (LoadBalancer Service) --> Ingress Controller --> echo-server
Note: Locust needs to run in the same namespace as the Ingress Controller for this issue to occur.
Context
This is the GNP policy we have in place, this GNP is designed to ensure that requests to port 8443 (the webhook port) are only allowed from the control-plane node, otherwise it allows ingress on all the other ports, including 15008.
This GNP is used to mitigate security issues in the NGINX Ingress Controller webhook admission controller, this policy works as expected, as far as allow/deny traffic.
apiVersion: crd.projectcalico.org/v1
kind: GlobalNetworkPolicy
metadata:
name: default.int-ic-admission-webhook-restricted
spec:
applyOnForward: true
ingress:
- action: Allow
destination:
ports:
- 8443
protocol: TCP
source:
selector: has(node-role.kubernetes.io/control-plane)
- action: Allow
destination:
ports:
- 15008
- 4191
- 80
- 443
- 10254
protocol: TCP
- action: Deny
namespaceSelector: projectcalico.org/name == 'platform'
selector: app.kubernetes.io/instance == 'int-ic'
types:
- Ingress
Your Environment
-
Calico version: v3.29.4
-
Calico dataplane (bpf, nftables, iptables, windows etc.): iptables
-
Orchestrator version (e.g. kubernetes, openshift, etc.): K8s v1.32.7
-
Operating System and version: Ubuntu 24.04.3 LTS, Kernel: 6.14.0-1015-aws
-
Istio/ztunnel version: 1.28.0 (same issue observed on versions 1.27.x)
-
NGINX Ingress Controller v1.11.6
-
Backend service: jmalloc/echo-server:v0.3.7
-
Locust 2.41.6
Interesting, thanks for raising! Not sure off the top of my head why this would result in latency issues (and I see you have allowed ingress t0 port 15008 to allow the mesh traffic 👍 )
apiVersion: crd.projectcalico.org/v1
I am obligated to point at this issue: https://github.com/projectcalico/calico/issues/6412
Maybe unrelated, but worth calling out.
applyOnForward: true
Do you need this applyOnForward? This is simply selecting some Kubernetes pods and allowing ingress to certain ports from certain selectors right?
Hey @caseydavenport 👋🏼 thank you for taking a look!
Re: apiVersion: crd.projectcalico.org/v1 & https://github.com/projectcalico/calico/issues/6412, thanks for pointing this out, I am aware of this, however due to a long string of internal constraints, we can't deploy Calico's API Server in order to use projectcalico.org/v3 directly so we generate the policy with calicoctl and then apply the produced "backend" manifest.
Do you need this applyOnForward? This is simply selecting some Kubernetes pods and allowing ingress to certain ports from certain selectors right?
I believe this is required, the rule:
- action: Allow
destination:
ports:
- 8443
protocol: TCP
source:
selector: has(node-role.kubernetes.io/control-plane)
applies to hostendpoints, since this essentially allow access to port 8443 on the nginx ingress controllers from the control plane nodes, which are running the apiservers on host network.
It allows from host endpoints, but doesn't apply to host endpoints, so I think you should be OK without the applyOnForward (not that I would necessarily expect this to fix the performance issue, but worth cutting out any unnecesary bits to simplify the problem space a bit).
I think you'll only need the applyOnForward if you want the policy to apply to traffic forwarded through host interfaces.
Thanks @caseydavenport, and apologies for the delay in response, I'll try without applyOnForward on report back.