AKS BYOCNI incoming HTTP requests times out
Is there an existing issue for this?
- [x] I have searched the existing issues
Version
equal or higher than v1.17.4 and lower than v1.18.0
What happened?
I have setup Cilium on AKS with BYOCNI. I am using DSR with geneve and tunnel.
I would expect requests to https://my-domain.com to succeed however, it currently times out.
UPDATE:
I tried hybrid and snat modes. They don't seem to be working either, for both endpoints in Gateway are missing and Azure UI still reports unhealthy probes.
How can we reproduce the issue?
- Install Cilium on AKS with the following config:
aksbyocni:
enabled: true
annotateK8sNode: true
authentication:
mutual:
spire:
enabled: true
bandwidthManager:
bbr: true
enabled: true
bpf:
distributedLRU:
enabled: true
hostLegacyRouting: false
masquerade: true
preallocateMaps: true
tproxy: true
bpfClockProbe: true
cluster:
id: 10
name: my-cluster
encryption:
enabled: true
nodeEncryption: true
type: wireguard
wireguard:
persistentKeepalive: 25s
endpointRoutes:
enabled: true
envoy:
enabled: true
envoyConfig:
enabled: true
gatewayAPI:
enableAlpn: true
enableAppProtocol: true
enabled: true
ipam:
operator:
clusterPoolIPv4PodCIDRList: 10.13.0.0/16
k8sServiceHost: my-cluster-xxxxxxxx.hcp.centralindia.azmk8s.io
k8sServicePort: "443"
kubeProxyReplacement: true
loadBalancer:
acceleration: native
algorithm: maglev
dsrDispatch: geneve
experimental: true
l7:
algorithm: least_request
backend: envoy
mode: dsr
serviceTopology: true
localRedirectPolicy: true
maglev:
hashSeed: xxxxxxxxxxxxxxxx
nodePort:
enabled: true
nodeinit:
enabled: true
operator:
enabled: true
pmtuDiscovery:
enabled: true
policyEnforcementMode: default
routingMode: tunnel
socketLB:
enabled: true
tunnelProtocol: geneve
wellKnownIdentities:
enabled: true
-
Setup cert-manager (details omitted)
-
Setup Gateway (status included for reference)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
annotations:
cert-manager.io/issuer: letsencrypt-simple
cert-manager.io/subject-countries: US
cert-manager.io/subject-localities: City
cert-manager.io/subject-organizationalunits: IT
cert-manager.io/subject-organizations: Company
cert-manager.io/subject-provinces: State
name: simple-gateway
namespace: simple-system
status:
addresses:
- type: IPAddress
value: 98.70.241.188
conditions:
- lastTransitionTime: '2025-06-11T04:11:03Z'
message: Gateway successfully scheduled
observedGeneration: 1
reason: Accepted
status: 'True'
type: Accepted
- lastTransitionTime: '2025-06-11T04:11:03Z'
message: Gateway successfully reconciled
observedGeneration: 1
reason: Programmed
status: 'True'
type: Programmed
listeners:
- attachedRoutes: 1
conditions:
- lastTransitionTime: '2025-06-11T10:30:55Z'
message: Listener Programmed
observedGeneration: 1
reason: Programmed
status: 'True'
type: Programmed
- lastTransitionTime: '2025-06-11T10:30:55Z'
message: Listener Accepted
observedGeneration: 1
reason: Accepted
status: 'True'
type: Accepted
- lastTransitionTime: '2025-06-11T10:30:55Z'
message: Resolved Refs
reason: ResolvedRefs
status: 'True'
type: ResolvedRefs
name: my-domain-com-http
supportedKinds:
- group: gateway.networking.k8s.io
kind: HTTPRoute
spec:
gatewayClassName: cilium
listeners:
- allowedRoutes:
namespaces:
from: All
hostname: 'my-domain.com'
name: my-domain-com-http
port: 443
protocol: HTTPS
tls:
certificateRefs:
- group: ''
kind: Secret
name: my-domain-com-tls
mode: Terminate
- Setup HTTPRoute (status included for reference)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-route
namespace: my-namespace
status:
parents:
- conditions:
- lastTransitionTime: '2025-06-11T10:30:55Z'
message: Accepted HTTPRoute
observedGeneration: 2
reason: Accepted
status: 'True'
type: Accepted
- lastTransitionTime: '2025-06-11T10:30:55Z'
message: Service reference is valid
observedGeneration: 2
reason: ResolvedRefs
status: 'True'
type: ResolvedRefs
controllerName: io.cilium/gateway-controller
parentRef:
group: gateway.networking.k8s.io
kind: Gateway
name: simple-gateway
namespace: simple-system
spec:
hostnames:
- my-domain.com
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: simple-gateway
namespace: simple-system
rules:
- backendRefs:
- group: ''
kind: Service
name: my-service
port: 80
weight: 1
matches:
- path:
type: PathPrefix
value: /
-
Wait for cert-manager to provision certificates
-
kubectl describe svc cilium-gateway-simple-gateway -n simple-system
Name: cilium-gateway-simple-gateway
Namespace: simple-system
Labels: gateway.networking.k8s.io/gateway-name=simple-gateway
io.cilium.gateway/owning-gateway=simple-gateway
Annotations: <none>
Selector: <none>
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.11.220.143
IPs: 10.11.220.143
LoadBalancer Ingress: 98.70.241.188 (VIP)
Port: port-443 443/TCP
TargetPort: 443/TCP
NodePort: port-443 32600/TCP
Endpoints: # <--- NOTICE: no endpoints
Session Affinity: None
External Traffic Policy: Cluster
Internal Traffic Policy: Cluster
Events: <none>
Notice Gateway does not have any registered endpoints.
Cilium Version
Client: 1.17.4 55aecc0f 2025-05-14T15:00:13+00:00 go version go1.24.3 linux/arm64 Daemon: 1.17.4 55aecc0f 2025-05-14T15:00:13+00:00 go version go1.24.3 linux/arm64
Kernel Version
Linux aks-wsx2c4mz2-23651340-vmss000002 6.6.85.1-2.azl3 #1 SMP Tue Apr 29 22:00:30 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
Kubernetes Version
Client Version: v1.32.2 Server Version: v1.32.4
Anything else?
AKS Load Balancer health checks are failing.
Cilium Users Document
- [ ] Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
The Cilium Gateway service will never have any registered endpoints, as Cilium intercepts traffic bound for this service and sends it directly to Envoy using TPROXY rules.
We've also never tested doing Gateway API with DSR - this doesn't really make sense as the response traffic must go via the Envoy that receives the traffic in order for Envoy and Gateway API processing to work correctly. What are you trying to achieve by using DSR in this context? Does this problem occur if you disable DSR?
Could you provide a sysdump of (a subset) of your cluster, and ensure that it includes a sample failed flow in the Hubble logs?
I'm not sure how DSR and L7-load balancing interact. Perhaps poorly.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.