calico icon indicating copy to clipboard operation
calico copied to clipboard

Degraded performance (latency, connection errors) when workload is in Istio Ambient (ztunnel) and protected by GlobalNetworkPolicy

Open dkulchinsky opened this issue 4 months ago • 4 comments

Hey folks 👋🏼 I'm cross-referencing an issue I opened for Istio ztunnel (ambient proxy): https://github.com/istio/ztunnel/issues/1666

As I described in that issue, it wasn't at first apparent what was causing the performance issues we were observing, however we've since realized that it was the combination of using ztunnel (Istio Ambient) & Calico GlobalNetworkPolicy

Expected Behavior

Using a GlobalNetworkPolicy should not affect the performance & reliability of traffic to the workload it selects.

Current Behavior

When the workload (NGINX Ingress Controller) is protected by a GNP, we are seeing high rate of connection errors & latency spikes.

Possible Solution

No solution was found, however the issue is mitigated when:

  1. The GNP is removed
  2. The workload (ingress controller) is removed from the Ambient mesh
  3. The source workload is in a different namespace than the ingress controllers (this is not 100% confirmed, I'm still testing, but seem to be the case)

Steps to Reproduce (for bugs)

The referenced ztunnel issue provides all the details about our test setup and how to reproduce a similar environment.

Locust (running in-cluster) --> AWS NLB (LoadBalancer Service) --> Ingress Controller --> echo-server

Note: Locust needs to run in the same namespace as the Ingress Controller for this issue to occur.

Context

This is the GNP policy we have in place, this GNP is designed to ensure that requests to port 8443 (the webhook port) are only allowed from the control-plane node, otherwise it allows ingress on all the other ports, including 15008.

This GNP is used to mitigate security issues in the NGINX Ingress Controller webhook admission controller, this policy works as expected, as far as allow/deny traffic.

apiVersion: crd.projectcalico.org/v1
kind: GlobalNetworkPolicy
metadata:
  name: default.int-ic-admission-webhook-restricted
spec:
  applyOnForward: true
  ingress:
  - action: Allow
    destination:
      ports:
      - 8443
    protocol: TCP
    source:
      selector: has(node-role.kubernetes.io/control-plane)
  - action: Allow
    destination:
      ports:
      - 15008
      - 4191
      - 80
      - 443
      - 10254
    protocol: TCP
  - action: Deny
  namespaceSelector: projectcalico.org/name == 'platform'
  selector: app.kubernetes.io/instance == 'int-ic'
  types:
  - Ingress

Your Environment

  • Calico version: v3.29.4

  • Calico dataplane (bpf, nftables, iptables, windows etc.): iptables

  • Orchestrator version (e.g. kubernetes, openshift, etc.): K8s v1.32.7

  • Operating System and version: Ubuntu 24.04.3 LTS, Kernel: 6.14.0-1015-aws

  • Istio/ztunnel version: 1.28.0 (same issue observed on versions 1.27.x)

  • NGINX Ingress Controller v1.11.6

  • Backend service: jmalloc/echo-server:v0.3.7

  • Locust 2.41.6

dkulchinsky avatar Nov 18 '25 14:11 dkulchinsky

Interesting, thanks for raising! Not sure off the top of my head why this would result in latency issues (and I see you have allowed ingress t0 port 15008 to allow the mesh traffic 👍 )

apiVersion: crd.projectcalico.org/v1

I am obligated to point at this issue: https://github.com/projectcalico/calico/issues/6412

Maybe unrelated, but worth calling out.

applyOnForward: true

Do you need this applyOnForward? This is simply selecting some Kubernetes pods and allowing ingress to certain ports from certain selectors right?

caseydavenport avatar Nov 18 '25 19:11 caseydavenport

Hey @caseydavenport 👋🏼 thank you for taking a look!

Re: apiVersion: crd.projectcalico.org/v1 & https://github.com/projectcalico/calico/issues/6412, thanks for pointing this out, I am aware of this, however due to a long string of internal constraints, we can't deploy Calico's API Server in order to use projectcalico.org/v3 directly so we generate the policy with calicoctl and then apply the produced "backend" manifest.

Do you need this applyOnForward? This is simply selecting some Kubernetes pods and allowing ingress to certain ports from certain selectors right?

I believe this is required, the rule:

  - action: Allow
    destination:
      ports:
      - 8443
    protocol: TCP
    source:
      selector: has(node-role.kubernetes.io/control-plane)

applies to hostendpoints, since this essentially allow access to port 8443 on the nginx ingress controllers from the control plane nodes, which are running the apiservers on host network.

dkulchinsky avatar Nov 19 '25 02:11 dkulchinsky

It allows from host endpoints, but doesn't apply to host endpoints, so I think you should be OK without the applyOnForward (not that I would necessarily expect this to fix the performance issue, but worth cutting out any unnecesary bits to simplify the problem space a bit).

I think you'll only need the applyOnForward if you want the policy to apply to traffic forwarded through host interfaces.

caseydavenport avatar Nov 19 '25 16:11 caseydavenport

Thanks @caseydavenport, and apologies for the delay in response, I'll try without applyOnForward on report back.

dkulchinsky avatar Nov 25 '25 14:11 dkulchinsky