stunner icon indicating copy to clipboard operation
stunner copied to clipboard

Consent check failures during graceful shutdown

Open surkov-sesame opened this issue 5 months ago • 10 comments

Description

It seems that my aiortc clients get disconnected from Stunner prematurely due to failed consent checks during a graceful shutdown period.

Steps to Reproduce

  • I have an aiortc server using Stunner as TURN server in headless mode
  • Stunner is running on GKE
  • when GKE starts a deploy, Stunner enters graceful shutdown as expected, however aiortc consent checks begin failing and aiortc disconnects with "Consent to send expired" well before the Stunner pod exits

Timing in Stunner logs:

Jul 25 12:30:03.087
19:30:03.014008 main.go:171: stunnerd INFO: Normal exit on graceful shutdown
Jul 25 12:19:42.696
19:19:41.727628 main.go:180: stunnerd INFO: Canceling config loader
Jul 25 12:19:42.696
19:19:41.727574 main.go:175: stunnerd INFO: Commencing graceful shutdown with 26 active connection(s)

And aiortc consent checks fail at around 12:27, so ~3 minutes before stunnerd actually exits.

Expected behavior:

  • STUNner pod keeps running for 1h or until the call ends

Actual behavior:

  • aiortc client disconnects due to failed consent checks

Versions

Stunner 1.1.0

Info

Gateway API status

get.txt

Operator logs

N/A

surkov-sesame avatar Jul 25 '25 23:07 surkov-sesame

Can you please explain what an aiortc "consent check" is and what kinds of services this check expects from STUNner? If this involves sending a couple of packets back and forth over an existing TURN allocation then this should indeed work fine, but if it requires opening a new TURN allocation then you need another functional STUNner instance to handle that (is there any?). And if this has nothing to do with TURN or ICE then I'm afraid the problem is elsewhere (maybe STUNner implements graceful shutdown just right but your aiortc backend server does not?).

Anyway, please provide some more info. If you're sure it's STUNner, please also send a detailed stunnerd log (at level all:TRACE) for when the "consent check" fails.

rg0now avatar Jul 28 '25 09:07 rg0now

Consent checks in aiortc are implementing https://www.rfc-editor.org/rfc/rfc7675.html, which is periodic STUN binding requests. It's basically a ping between the aiortc/aioice WebRTC client and the TURN server. If consent checks fail after multiple retries (36 total attempts in our case), aiortc/aioice declare the connection broken and disconnect.

Attaching a full stunnerd log. stunnerd_logs.csv

The shutdown starts here

2025-07-25T19:19:42.696Z,19:19:41.727574 main.go:175: stunnerd INFO: Commencing graceful shutdown with 26 active connection(s)

and ends here

2025-07-25T19:30:03.087Z,19:30:03.014008 main.go:171: stunnerd INFO: Normal exit on graceful shutdown

and the "consent expiration" on the client happens at 12:27.

A few considerations:

  • our aiortc client failed at ~12:27, but it had multiple retries with exponential backoffs and some jitter. Assuming all those retries failed, I'd guess they started failing ~215s earlier than that, which puts it at ~12:23:30 for when all consent checks started failing
  • I see a lot of cds-client ERROR: failed to init CDS watcher which started around the time of entering the graceful shutdown mode. Could that be related?

surkov-sesame avatar Jul 28 '25 16:07 surkov-sesame

Some updates after more testing.

It seems that this graceful shutdown problem might not be aiortc/aioice-specific or even related to consent checks per se -- it might be a general graceful shutdown problem with Stunner in our GKE setup. What I learned after more testing is that Chrome also gets disconnected from a WebRTC call shortly after Stunner is entering graceful shutdown.

Here is another stunnerd log attached: extract-2025-07-29T19_32_10.663Z.csv

The timing there is:

  • entering graceful shutdown at 19:28:36.809Z
  • Chrome gets an iceconnection disconnect at 19:29:23Z and the call ends
  • the stunnerd pod keeps running on GKE in Terminating state for many more minutes

From chrome://webrtc-internals (timestamp there are in PST): chrome-webrtc-internals.txt

So then the easiest way to reproduce is

  • run Stunner on GKE
  • start a TURN webrtc call with Chrome
  • initiate a Stunner restart
  • observe that Chrome gets disconnected from the call

surkov-sesame avatar Jul 29 '25 19:07 surkov-sesame

As per the dial tcp 10.101.10.186:13478: operation was canceled log messages: that's e spurious log that we shouldn't emit at all, will fix that ASAP.

As per the rest: here are the last couple of lines before the allocation goes away.

"2025-07-29T19:29:00.827Z" handlers.go:98: auth INFO: permission denied on listener ""stunner/stunner-gateway-udp/turn-udp"" for client ""10.100.0.34:57138"" to peer 169.254.8.1: no route to endpoint"
"2025-07-29T19:29:00.827Z" turn.go:407: turn INFO: permission denied for client 10.100.0.34:57138 to peer 169.254.8.1 (suppressed 4 log events)"
"2025-07-29T19:29:00.827Z" handlers.go:233: stunner DEBUG: Allocation error: client=10.100.0.34:57138-[::]:3478:UDP, error=failed to handle ChannelBind-request from 10.100.0.34:57138: permission request administratively prohibited"
...
"2025-07-29T19:29:01.828Z" handlers.go:236: stunner DEBUG: Permission created: client=10.100.0.34:57138-[::]:3478:UDP, username=1753820940:X7uvm1SGUeONjRFhnsDuyaEPxi92, realm=stunner.l7mp.io, relay-addr=10.101.27.252:52807, peer=10.101.11.244"
"2025-07-29T19:29:01.828Z" handlers.go:261: stunner DEBUG: Channel created: listener=stunner/stunner-gateway-udp/turn-udp, cluster=stunner/stunner-headless, client=10.100.0.34:57138-[::]:3478:UDP, username=1753820940:X7uvm1SGUeONjRFhnsDuyaEPxi92, realm=stunner.l7mp.io, relay-addr=10.101.27.252:52807, peer=10.101.11.244:56339, channel-num=16386"
...
"2025-07-29T19:29:01.828Z" handlers.go:240: stunner DEBUG: Permission deleted: client=10.100.0.35:42430-[::]:3478:UDP, username=1753820865:mmTLzQtD0bgYxYJ1chLuqePwfTD3, realm=stunner.l7mp.io, relay-addr=10.101.27.252:52663, peer=10.101.11.244"
"2025-07-29T19:29:01.828Z" handlers.go:270: stunner DEBUG: Channel deleted: client=10.100.0.35:42430-[::]:3478:UDP, username=1753820865:mmTLzQtD0bgYxYJ1chLuqePwfTD3, realm=stunner.l7mp.io, relay-addr=10.101.27.252:52663, peer=10.101.11.244:40757, channel-num=16386"
"2025-07-29T19:29:01.828Z" handlers.go:228: stunner DEBUG: Allocation deleted: client=10.100.0.35:42430-[::]:3478:UDP, username=1753820865:mmTLzQtD0bgYxYJ1chLuqePwfTD3, realm=stunner.l7mp.io"

I can't see anything extraordinary in the logs, apart from the permission denied errors. To understand why this happens, we need to know who is the client 10.100.0.34 and why is it trying to reach the peer 169.254.8.1 (which it should not) and later 10.101.11.244 (which is OK, so the permission is granted). The last couple of lines are a completely normal shutdown sequence initiated by the client 10.100.0.35. Is this your Chrome client? Earlier there are permission denied errors from this client to peer 169.254.9.1, who can this be?

Note that permssion denied errors in many cases are completely normal (e.g., when the media server is configured with STUNner as the TURN server, which is best to avoid in Kubernetes), so this may or may not be related to your issue. In order to learn more, please elevate the stunnerd log level to ALL:DEBUG or all:TRACE and redo the logs. An even bigger help would be a WireShark/tcpdump UDP packet trace from the client side that we can correlate with the STUNner-side logs.

We'll also try to look into this. Graceful shutdown is a critical feature and it should work fine.

rg0now avatar Jul 30 '25 10:07 rg0now

I've managed to reproduce this with the simple tunnel test. Here is what I've found out so far.

The symptom is that the terminating stunnerd pod stops receiving packets roughly 30 secs after entering graceful shutdown. It is not that it fails to process UDP packets for some reason after 30 secs, it does not even get the sent UDP packets any more (conformed with kubectl debug ... -- tcpdump .... ). I can confirm that the problem is not in STUNner: someone upstream (Kubernetes or a middlebox, ie., a LB, NAT, kube-proxy between the client and the stunnerd pod) stops to forward UDP packets for some reason roughly half a minute after we enter graceful shutdown.

The problem reproducibly appears with UDP on Azure/AKS (and by your report, GKE) but not on Minikube, and it does not reproduce with TCP at all.

Here is how you can easily reproduce the issue.

  1. Deploy the simple tunnel test and set up the turncat proxy for TURN/UDP as described in the instructions.

  2. Elevate the dataplane loglevel to all:DEBUG: stunnerd will emit 3 log messages for every packet it receives so this will be perfect to watch for the point when we stop receiving packets.

  3. Start the local iperf client but make sure that it never stops (use -t 0). I use the below command line:

    iperf -c localhost -p 5000 -u -i 1 -l 100 -b 800 -t 0
    
  4. Wait until things stabilize and then kill the stunnerd pod (this will be called stunner/stunnerd-xxx-yy). You should see kubectl -n stunner get pods -o wide reporting that the pod status has become Terminating:

    stunner  pod/udp-gateway-xxx-yy    0/1 Terminating   0  6m33s
    
  5. Switch to another terminal and watch the logs from the killed pod: you should see the logs for the received TURN ChannelData messages for roughly 30 secs after the graceful shutdown was initiated and then the logs stop, indicating that TURN packets no longer reach the pod.

    kubectl -n stunner logs udp-gateway-xxx-yy -f
    

There have been some major changes in Kubernetes v1.26 related to how Kubernetes handles graceful shutdown (e.g., https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering) and we'll need to dig a bit deeper to understand how these changes interact with our TURN/UDP streams.

rg0now avatar Jul 30 '25 18:07 rg0now

I can't seem to reproduce this on a pure UDP client-server connection without STUNner being involved. This might just be good news, as this test indicates the issue may still be related to our graceful shutdown implementation.

Here is how I tested:

  1. Deploy a l7mp/net-debug Deployment, our workhorse for network debugging in Kubernetes. We set the termination grace period to 1 hour and ignore SIGTERM signals, Kubernetes's way of telling us that we should initiate graceful shutdown and start draining connections.

    kubectl apply -f - <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: iperf-server
      namespace: default
      labels:
        app: iperf-server
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: iperf-server
      template:
        metadata:
          labels:
            app: iperf-server
        spec:
          containers:
          - args:
            - -c
            - trap '' TERM; while true; do sleep 3600; done
            command:
            - /bin/bash
            image: docker.io/l7mp/net-debug:latest
            name: net-debug
          terminationGracePeriodSeconds: 3600
    EOF
    
  2. Run an iperf server. For maximum safety, we again ignore SIGTERMs:

    kubectl exec iperf-server-xxx-yy -it -- /bin/bash -c "trap '' TERM; /usr/bin/iperf -s -p 5001 -u -e -i 1"
    
  3. Expose the iperf UDP server over a LoadBalancer Service and wait until a public IP is assigned to the service, let it be A.B.C.D:

    kubectl expose deployment iperf-server --port 5001 --protocol=UDP --type=LoadBalancer
    
  4. Start a local iperf UDP client in another terminal. Switch back to where you started the iperf server: you should see logs showing that the server is receiving the traffic from the iperf client:

    iperf -c A.B.C.D -p 5001 -u -i 1 -l 100 -b 800 -t 0
    
  5. Delete the server pod: kubectl delete pod iperf-server-xxx-yy. The pod status goes into Terminating and a new iperf-server pod shows up. We expect the iperf UDP client traffic to go away in 30 secs as it happens in our STUNner test, but nothing like that happens: the iperf-server keeps on logging received UDP packets.

If this is so then this is good news: the problem is in STUNner's way of handling graceful shutdown so we have a chance to address it. Can you run the tests from this and the previous post and tell us the results?

rg0now avatar Jul 31 '25 09:07 rg0now

Hey! I'm unfortunately going to a short vacation till Wed, I'll try to get to this on Wed, August 6.

surkov-sesame avatar Aug 01 '25 01:08 surkov-sesame

The more I look at this the more it seems to be a Kubernetes regression. Some additional info below.

We managed to reproduce this on Kubernetes v1.31.0, but not on v1.30.5. The problem seems to have been introduced sometime between. Indeed there were a couple of updates affecting the UDP kube-proxy datapath during this timeframe like https://github.com/kubernetes/kubernetes/pull/127318 (not sure if related).

Here is what we see on v1.31 with the iperf test after graceful shutdown has initiated and we fail the readiness probe:

  1. Old pod enters Terminating state, new pod comes up and becomes Ready:

    $ kubectl -n stunner get pods -o wide
    NAME                           READY   STATUS        RESTARTS   AGE     IP             
    udp-gateway-7f96d74d9f-9g527   1/1     Running       0          3m47s   10.244.0.255   
    udp-gateway-7f96d74d9f-l96pg   0/1     Terminating   0          41h     10.244.0.206   
    
  2. The EndpointSlice status info for the old pod is set to ready: false, serving: false and terminating: true:

    apiVersion: discovery.k8s.io/v1
    kind: EndpointSlice
    metadata:
      name: udp-gateway-ps28f
      namespace: stunner
      labels:
        kubernetes.io/service-name: udp-gateway
    addressType: IPv4
    ports:
    - name: udp-listener
      port: 3478
      protocol: UDP
    endpoints:
    - addresses:
      - 10.244.0.206
      conditions:
        ready: false
        serving: false
        terminating: true
      ...
    - addresses:
      - 10.244.0.255
      conditions:
        ready: true
        serving: true
        terminating: false
      ...
    
  3. This causes the removal of our UDP conntrack entry from the kube-proxy. I'm not sure whether it times out (the 30 sec delay makes me think that it does) or simply gets cleaned up as per the above linked PR (the cleanup code seems to be running from a ticker set to 30 sec by default, see iptables-sync-period duration Default: 30s here: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy).

  4. The old stunnerd receives no more packets and kube-proxy redirects the UDP flow to the new pod, which of course does not know how to handle it and hence drops it:

    11:18:15.067093 server.go:40: turn DEBUG: Received 104 bytes of udp from 10.224.0.4:49920 on [::]:3478
    11:18:15.067122 server.go:50: turn DEBUG: Received DataPacket from 10.224.0.4:49920
    11:18:15.067129 turn.go:435: turn DEBUG: Received ChannelData from 10.224.0.4:49920
    11:18:15.067144 handlers.go:233: stunner DEBUG: Allocation error: client=10.224.0.4:49920-[::]:3478:UDP, error=unable to handle ChannelData from 10.224.0.4:49920: no allocation found 10.224.0.4:49920:[::]:3478
    11:18:15.067152 server.go:245: turn DEBUG: Failed to handle datagram: unable to handle ChannelData from 10.224.0.4:49920: no allocation found 10.224.0.4:49920:[::]:3478   
    

So the symptom is not that our TURN/UDP packets are lost after we enter the graceful shutdown, they are simply redirected to the new stunnerd pod that of course lacks the allocation context to know how to handle it. This seems to be a kube-proxy regression I think (not sure).

rg0now avatar Aug 01 '25 12:08 rg0now

On the 1st test -- I had some troubles running the turncat test on GKE, even before trying the shutdown. I followed the steps from the wiki and I was getting this error in turncat:

10:51:28.854377 turncat.go:464: turncat WARNING: Cannot connect back to client udp:127.0.0.1:51434: dial udp 127.0.0.1:5003->127.0.0.1:51434: bind: address already in use

(I tried changing the local port 5000 to 5002 or 5003 and it was all the same. The port 51434 was unoccupied, so unsure what this binding error means) At some point I somehow managed to get rid of that warning, not sure how exactly, but the tunnel still didn't work: my local iperf was sending traffic, but iperf-server wasn't receiving it, the server logs only contained this:

------------------------------------------------------------
Server listening on UDP port 5001 with pid 1
Read buffer size: 1470 Byte
UDP buffer size:  208 KByte (default)
------------------------------------------------------------

So I'm sorry, I couldn't confirm the shutdown misbehavior you saw with the first test on GKE yet.

For the second test though I had more luck reproducing that with GKE. I was able to confirm the same thing that you described: I killed the pod, and the logs from kubectl exec iperf-server-xxx-yy ... kept printing received packages for at least 5 more minutes (then I stopped the test).

surkov-sesame avatar Aug 06 '25 17:08 surkov-sesame

@surkov-sesame We have added a new command line flag --force-ready-status to stunnerd, which, if set, will prevent the TURN server from failing the readiness check during graceful shutdown. This fixes the graceful shutdown issue reliably in our tests.

You can modify the stunnerd command line arguments via the Dataplane CRD:

apiVersion: stunner.l7mp.io/v1
kind: Dataplane
metadata:
  name: default
spec:
  command:
  - stunnerd
  args:
  - -w
  - --udp-thread-num=16
  - --force-ready-status # <--- this is the new flag
  image: docker.io/l7mp/stunnerd:dev
  terminationGracePeriodSeconds: 3600
  ...

Available in the latest dev release. Can you please try?

rg0now avatar Aug 27 '25 20:08 rg0now