cilium icon indicating copy to clipboard operation
cilium copied to clipboard

AKS BYOCNI incoming HTTP requests times out

Open munjalpatel opened this issue 6 months ago • 1 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Version

equal or higher than v1.17.4 and lower than v1.18.0

What happened?

I have setup Cilium on AKS with BYOCNI. I am using DSR with geneve and tunnel. I would expect requests to https://my-domain.com to succeed however, it currently times out.

UPDATE: I tried hybrid and snat modes. They don't seem to be working either, for both endpoints in Gateway are missing and Azure UI still reports unhealthy probes.

How can we reproduce the issue?

  1. Install Cilium on AKS with the following config:
aksbyocni:
  enabled: true
annotateK8sNode: true
authentication:
  mutual:
    spire:
      enabled: true
bandwidthManager:
  bbr: true
  enabled: true
bpf:
  distributedLRU:
    enabled: true
  hostLegacyRouting: false
  masquerade: true
  preallocateMaps: true
  tproxy: true
bpfClockProbe: true
cluster:
  id: 10
  name: my-cluster
encryption:
  enabled: true
  nodeEncryption: true
  type: wireguard
  wireguard:
    persistentKeepalive: 25s
endpointRoutes:
  enabled: true
envoy:
  enabled: true
envoyConfig:
  enabled: true
gatewayAPI:
  enableAlpn: true
  enableAppProtocol: true
  enabled: true
ipam:
  operator:
    clusterPoolIPv4PodCIDRList: 10.13.0.0/16
k8sServiceHost: my-cluster-xxxxxxxx.hcp.centralindia.azmk8s.io
k8sServicePort: "443"
kubeProxyReplacement: true
loadBalancer:
  acceleration: native
  algorithm: maglev
  dsrDispatch: geneve
  experimental: true
  l7:
    algorithm: least_request
    backend: envoy
  mode: dsr
  serviceTopology: true
localRedirectPolicy: true
maglev:
  hashSeed: xxxxxxxxxxxxxxxx
nodePort:
  enabled: true
nodeinit:
  enabled: true
operator:
  enabled: true
pmtuDiscovery:
  enabled: true
policyEnforcementMode: default
routingMode: tunnel
socketLB:
  enabled: true
tunnelProtocol: geneve
wellKnownIdentities:
  enabled: true
  1. Setup cert-manager (details omitted)

  2. Setup Gateway (status included for reference)

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  annotations:
    cert-manager.io/issuer: letsencrypt-simple
    cert-manager.io/subject-countries: US
    cert-manager.io/subject-localities: City
    cert-manager.io/subject-organizationalunits: IT
    cert-manager.io/subject-organizations: Company
    cert-manager.io/subject-provinces: State
  name: simple-gateway
  namespace: simple-system
status:
  addresses:
    - type: IPAddress
      value: 98.70.241.188
  conditions:
    - lastTransitionTime: '2025-06-11T04:11:03Z'
      message: Gateway successfully scheduled
      observedGeneration: 1
      reason: Accepted
      status: 'True'
      type: Accepted
    - lastTransitionTime: '2025-06-11T04:11:03Z'
      message: Gateway successfully reconciled
      observedGeneration: 1
      reason: Programmed
      status: 'True'
      type: Programmed
  listeners:
    - attachedRoutes: 1
      conditions:
        - lastTransitionTime: '2025-06-11T10:30:55Z'
          message: Listener Programmed
          observedGeneration: 1
          reason: Programmed
          status: 'True'
          type: Programmed
        - lastTransitionTime: '2025-06-11T10:30:55Z'
          message: Listener Accepted
          observedGeneration: 1
          reason: Accepted
          status: 'True'
          type: Accepted
        - lastTransitionTime: '2025-06-11T10:30:55Z'
          message: Resolved Refs
          reason: ResolvedRefs
          status: 'True'
          type: ResolvedRefs
      name: my-domain-com-http
      supportedKinds:
        - group: gateway.networking.k8s.io
          kind: HTTPRoute
spec:
  gatewayClassName: cilium
  listeners:
    - allowedRoutes:
        namespaces:
          from: All
      hostname: 'my-domain.com'
      name: my-domain-com-http
      port: 443
      protocol: HTTPS
      tls:
        certificateRefs:
          - group: ''
            kind: Secret
            name: my-domain-com-tls
        mode: Terminate
  1. Setup HTTPRoute (status included for reference)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-route
  namespace: my-namespace
status:
  parents:
    - conditions:
        - lastTransitionTime: '2025-06-11T10:30:55Z'
          message: Accepted HTTPRoute
          observedGeneration: 2
          reason: Accepted
          status: 'True'
          type: Accepted
        - lastTransitionTime: '2025-06-11T10:30:55Z'
          message: Service reference is valid
          observedGeneration: 2
          reason: ResolvedRefs
          status: 'True'
          type: ResolvedRefs
      controllerName: io.cilium/gateway-controller
      parentRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: simple-gateway
        namespace: simple-system
spec:
  hostnames:
    - my-domain.com
  parentRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: simple-gateway
      namespace: simple-system
  rules:
    - backendRefs:
        - group: ''
          kind: Service
          name: my-service
          port: 80
          weight: 1
      matches:
        - path:
            type: PathPrefix
            value: /
  1. Wait for cert-manager to provision certificates

  2. kubectl describe svc cilium-gateway-simple-gateway -n simple-system

Name:                     cilium-gateway-simple-gateway
Namespace:                simple-system
Labels:                   gateway.networking.k8s.io/gateway-name=simple-gateway
                          io.cilium.gateway/owning-gateway=simple-gateway
Annotations:              <none>
Selector:                 <none>
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.11.220.143
IPs:                      10.11.220.143
LoadBalancer Ingress:     98.70.241.188 (VIP)
Port:                     port-443  443/TCP
TargetPort:               443/TCP
NodePort:                 port-443  32600/TCP
Endpoints:                # <--- NOTICE: no endpoints
Session Affinity:         None
External Traffic Policy:  Cluster
Internal Traffic Policy:  Cluster
Events:                   <none>

Notice Gateway does not have any registered endpoints.

Cilium Version

Client: 1.17.4 55aecc0f 2025-05-14T15:00:13+00:00 go version go1.24.3 linux/arm64 Daemon: 1.17.4 55aecc0f 2025-05-14T15:00:13+00:00 go version go1.24.3 linux/arm64

Kernel Version

Linux aks-wsx2c4mz2-23651340-vmss000002 6.6.85.1-2.azl3 #1 SMP Tue Apr 29 22:00:30 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

Kubernetes Version

Client Version: v1.32.2 Server Version: v1.32.4

Anything else?

Image

AKS Load Balancer health checks are failing.

Cilium Users Document

  • [ ] Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

munjalpatel avatar Jun 11 '25 13:06 munjalpatel

The Cilium Gateway service will never have any registered endpoints, as Cilium intercepts traffic bound for this service and sends it directly to Envoy using TPROXY rules.

We've also never tested doing Gateway API with DSR - this doesn't really make sense as the response traffic must go via the Envoy that receives the traffic in order for Envoy and Gateway API processing to work correctly. What are you trying to achieve by using DSR in this context? Does this problem occur if you disable DSR?

youngnick avatar Jun 16 '25 03:06 youngnick

Could you provide a sysdump of (a subset) of your cluster, and ensure that it includes a sample failed flow in the Hubble logs?

I'm not sure how DSR and L7-load balancing interact. Perhaps poorly.

squeed avatar Jun 18 '25 13:06 squeed

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] avatar Aug 18 '25 02:08 github-actions[bot]

This issue has not seen any activity since it was marked stale. Closing.

github-actions[bot] avatar Sep 02 '25 02:09 github-actions[bot]