quinn icon indicating copy to clipboard operation
quinn copied to clipboard

quinn_proto::connection: blocked by congestion control/blocked by pacing

Open tubzby opened this issue 9 months ago • 12 comments

I built a UDP proxy to emulate a network of 20ms delay and no packet lost, sitting between a local QUIC client and QUIC server. I'm trying to simulate a 4Mbps video stream, using quinn datagram, and the data was generated:

  1. around 30 times a second, each with a 33ms gap
  2. the first frame of each round is 30% * 4Mbps = 1.2Mbps = 150 Kbytes
  3. the other frame is (4-1.2)Mbps / (30 -1 ) = 12 Kbytes

I'm using the default quinn::TransportConfig, and I record the bytes sent from the client and received on the server, expecting to see it matches.

The process was running normally for about 30 seconds, then I saw only 80% of packets were received on the server, so I'm trying to debug, and found some tracing logs:

2025-02-16T12:02:21.203843Z TRACE drive{id=0}: quinn_proto::connection: blocked by congestion control
...
2025-02-16T12:02:41.226979Z TRACE drive{id=0}: quinn_proto::connection: blocked by pacing

I have investigated a little bit, it seems the window scale of TransportConfig should be increased, but I have no idea where the congestion control coming from.

Is there anything wrong with my simulation? 4Mbs and 20ms delay is relatively low.

tubzby avatar Feb 17 '25 08:02 tubzby

I saw only 80% of packets were received on the server

Are you losing IP packets or application-layer datagrams?

If there's no IP packet loss, but you are losing application-layer datagrams, then you're likely trying to send data at a higher rate than the congestion controller thinks is possible, and exhausting buffer space as a result. This should automatically stabilize over time, but you could adjust initial conditions by raising the congestion controller's initial window. Note that this may lead to catastrophic IP packet loss if you overestimate the path's true capacity.

Alternatively, you could make the datagram send buffer much larger, but while that might avoid packet loss if large enough, data will still be sent slowly until congestion control stabilizes, which might not suit your purposes.

Streaming media might benefit from a slightly different congestion control strategy than cubic, but I think this is an open research area. I think @kixelated has been exploring this as part of their MoQ implementation effort.

Ralith avatar Feb 17 '25 20:02 Ralith

Yeah, live media is what is known as "application-limited". It means the congestion controller has trouble increasing the bitrate because there's limited data to send. TCP congestion control algorithms (ex. Reno, CUBIC, BBR) won't increase the congestion window size unless it's being fully utilized. They just were not designed for frame-by-frame delivery.

An unfortunate reality of live media is that I-frames are huge and bursty. The only time you end up fully utilizing the congestion window is during I-frames. Unfortunately, this means that the congestion window is always sized at least slllllightly below the average I-frame size.

This means it takes at least 2 RTTs to deliver an I-frame in addition to the bitrate being artificially lower than possible. If you datagram send/recv buffer is not large enough, you will drop them. It's one of the reasons why I'm using QUIC streams and not datagrams; some frames may take multiple round trips (not to mention retransmissions).

You could ignore the congestion controller and it'll work until it doesn't. Eventually you cause bufferbloat and your packets get queued for a significant time over the network.

The solution I came up with is probing. 1/8th of the time, send padding until the congestion window is full. It's a waste of data but actually lets the congestion control algorithm stress the network (like a speed test) and dramatically increases the estimated bitrate. It's what WebRTC does under the hood too.

kixelated avatar Feb 17 '25 21:02 kixelated

Are you losing IP packets or application-layer datagrams?

@Ralith I'm losing application-layer datagrams. It stabilized later, but some frames were dropped so that the video froze for a while, will try to increase initial_window later.

tubzby avatar Feb 18 '25 00:02 tubzby

@kixelated thanks for the detailed explanation.

I also tried QUIC stream, but it is not suitable for real-time communication, if the network is blocked for 5 seconds, it's better to just drop outdated frames instead of queueing them in buffer.

If you datagram send/recv buffer is not large enough, you will drop them

I experienced this case which quinn will print dropping outgoing datagram, but not for this issue.

send padding until the congestion window is full

Would it be reasonable to estimate the right window scale? since we can control how cubic behaves and avoid sending paddings to exhaust the network.

You mentioned WebRTC is doing the same, can you provide some references, thanks.

tubzby avatar Feb 18 '25 01:02 tubzby

I have increased QubicConfig.initial_window and the log( blocked by congestion control/blocked by pacing) vanished, but I still experienced only receiving 70% of data in less than a minute.

I confirmed RTT is around 22ms and lost_packets is zero.

Is the following log related?

2025-02-18T03:31:38.805794Z TRACE drive{id=0}: quinn_proto::connection: max ack delay reached
2025-02-18T03:31:38.842694Z TRACE drive{id=0}: quinn_proto::connection: timeout timer=MaxAckDelay
2025-02-18T03:31:38.842888Z TRACE drive{id=0}: quinn_proto::connection: max ack delay reached
2025-02-18T03:31:38.879896Z TRACE drive{id=0}: quinn_proto::connection: timeout timer=MaxAckDelay
2025-02-18T03:31:38.880042Z TRACE drive{id=0}: quinn_proto::connection: max ack delay reached

tubzby avatar Feb 18 '25 03:02 tubzby

I still experienced only receiving 70% of data in less than a minute.

I'd recommend trying to trace the individual lost packets through the stack. If you're not seeing "dropping outgoing datagram" messages, then they should be getting sent. Do they show up in wireshark on the send side? On the receive side? Do you see "dropping stale datagram" debug logs on the receiver? Are you sure there's no IP packet loss?

Is the following log related?

No.

Ralith avatar Feb 18 '25 03:02 Ralith

@kixelated thanks for the detailed explanation.

I also tried QUIC stream, but it is not suitable for real-time communication, if the network is blocked for 5 seconds, it's better to just drop outdated frames instead of queueing them in buffer.

Create a new QUIC stream for each GoP, cancelling the previous one. That will drop the outdated frames based on decodability, unlike datagrams that will be dropped "randomly".

send padding until the congestion window is full

Would it be reasonable to estimate the right window scale? since we can control how cubic behaves and avoid sending paddings to exhaust the network.

You mentioned WebRTC is doing the same, can you provide some references, thanks.

Predicting how the internet will handle a sudden influx of traffic is impossible. The job of the congestion controller is to try higher bitrates and back off if it goes to shit.

https://webrtchacks.com/probing-webrtc-bandwidth-probing-why-and-how-in-gcc/

kixelated avatar Feb 18 '25 10:02 kixelated

I still experienced only receiving 70% of data in less than a minute.

I'd recommend trying to trace the individual lost packets through the stack. If you're not seeing "dropping outgoing datagram" messages, then they should be getting sent. Do they show up in wireshark on the send side? On the receive side? Do you see "dropping stale datagram" debug logs on the receiver? Are you sure there's no IP packet loss?

Is the following log related?

No.

My bad, the packages assumed lost were actually seen in the next statistics cycle, like:

  1. cycle 1, sent 10, received 7
  2. cycle 2, sent 10, received 13

tubzby avatar Feb 18 '25 13:02 tubzby

Create a new QUIC stream for each GoP, cancelling the previous one. That will drop the outdated frames based on decodability, unlike datagrams that will be dropped "randomly".

I got the idea and checked out the project MOQ, sounds great!

tubzby avatar Feb 18 '25 13:02 tubzby

I have to increase 15 times the initial value to get rid of the blocked by congestion/pacing thing.

    let mut cubic = congestion::CubicConfig::default();
    // default 1200 * 10
    cubic.initial_window(1200 * 10 * 15);
    config.congestion_controller_factory(Arc::new(cubic));

  1. Do I have to set both on the client and server side?
  2. Does it require extra memory on the same network environment than the default value?

tubzby avatar Feb 18 '25 13:02 tubzby

Switching to default BBR congestion control works for me, what's the risk?

tubzby avatar Feb 18 '25 14:02 tubzby

Do I have to set both on the client and server side?

Congestion control affects outgoing data, so you should adjust the congestion control config wherever you're sending data that you want to be affected by the adjusted config.

Does it require extra memory on the same network environment than the default value?

No, memory use is governed by flow control (and datagram buffer sizes).

Switching to default BBR congestion control works for me, what's the risk?

Our BBR congestion controller is highly experimental and not currently maintained. It's known to be substantially behind the upstream BBR spec, and was never very well tested. As a result, you may experience excessive packet loss, adversely affect other network traffic, and/or fail to make full use of the link under some conditions.

Ralith avatar Feb 18 '25 17:02 Ralith