Max PTO issue
I perform tperf on latest mvfst with such network environment: 40ms delay & 20% packet loss. And finally I get a Exceeded max PTO error. More specifically, in server I set the buffer with command: sysctl -w net.core.wmen_max=52428800 sysctl -w net.core.wmen_default=52428800, set network card with sudo tc qdisc add dev enp61s0f0 root netem delay 40ms loss 20%. After setting I run tperf with ./tperf -mode=server -host=10.99.211.141 -port=6666 -pacing=true -gso=true -congestion=bbr -max_cwnd_mss=860000 -window=65536000. Error is: E0628 12:08:38.546244 26054 tperf.cpp 201] write error with stream=3 error=LocalError: No Error, Exceeded max PTO. As for client side, I set the buffer with command: sysctl -w net.core.rmen_max=52428800 sysctl -w net.core.rmen_default=52428800, After setting I run tperf with ./tperf -mode=server -host=10.99.211.141 -port=6666 -pacing=true -gso=true -congestion=bbr -max_cwnd_mss=860000 -window=65536000 -duration=600, error is: TPerfClient error: Interal Error. How to interpret MAX PTO problem in such condition? Can you give some advices why it occured and how can I probe into this problem? Thank you very much
I found the same problem that with bandwidth=1Gb, server causes 7 times onPTOAlarm with error "Exceeded max PTO", I found that every time server triggers onPTO Alarm it then sends 2 packet (size=1252), and client have received all of them, sends ack with size=1267 but server seems doesn't receives the ack or maybe have errors in parsing acks?
Q1: Considering the claim "constexpr uint64_t kDefaultV4UDPSendPacketLen = 1252", is there any problem with size 1267 or larger size that client responses?
Besides, when calling iobufChainBasedBuildScheduleEncrypt() in QuicTransportFunctions.cpp, the encodeSize is larger than 1252 due to "auto encodedSize = packetBuf->computeChainDataLength()", then I find that it is caused by "RegularQuicPacketBuilder::Packet RegularQuicPacketBuilder::buildPacket()" and bodyLength calculated in "size_t bodyLength = body_->computeChainDataLength()" is larger than 1252.
In normal circumstance, size of bodyLength is a relatively reasonable small value, but when there exists 20% pakcet loss, size of bodyLength increases dramaticlly until it exceeds 1252, which leads to send failure.
Q2: Could you please give me some advice about the tendency of bodyLength's change and what causes that change?
Thanks!
Futhermore, in function iobufChainBasedBuildScheduleEncrypt() it calculate the encodedSize which equals to "headerLen + bodyLen + aead.getCipherOverhead()", after LOG(INFO) I find the max encodedSize is 1268, and it exceeds the limit 1252 with 16.
More specifically, headerLen=11, aead.getCipherOverhead()= 16, and max value of bodyLen is 1241. I think that bodyLen is likely to exceeds the limit, and bodyLen has something to do with remainingBytes_(the input of the constructor of RegularQuicPacketBuilder), the original remainingBytes is connection.udpSendPacketLen(1252), I modifies the input param of iobufChainBasedBuildScheduleEncrypt as following:
RegularQuicPacketBuilder pktBuilder(
connection.udpSendPacketLen,
std::move(header),
getAckState(connection, pnSpace).largestAckedByPeer.value_or(0));
to:
RegularQuicPacketBuilder pktBuilder(
connection.udpSendPacketLen - aead.getCipherOverhead(),
std::move(header),
getAckState(connection, pnSpace).largestAckedByPeer.value_or(0));
I wonder if there could be some misunderstanding of this change, please tell me if I am wrong. Thanks
@Enjia may i know the revision of your repo? There was a problem with limiting packet size when we clone an old packet exactly due to miscounting the cipher overhead. But I think i fixed that recently.
@Ellie-fans : MAX PTO means the sender keeps sending but never got acks up to MAX PTO limit times. With 20% loss i'm not totally surprised it happens, but i think if we can rule out other reasons/bugs that would be great. In your test set up, is it possible to know if peer received packets? (We may need to fix Qlog for you to get that if you don't have your own logging around it.)
The qlog problem i mentioned is in https://github.com/facebookincubator/mvfst/issues/147
The mvfst version I use is June 23rd, and I test the newest version(i.e. July 3rd) and find that the problem is still there, which I think may be different with the one you have fixed?
Client sends ack larger than 1252 in response and server does receive the packet, and it enters function QuicServerWorker::onDataAvailable, because of the packet size, so param truncated is true, finally enters the error branch:
if (truncated) { // This is an error, drop the packet. return; }
which I think could be the reason why server doesn't receive ack triggerring PTO
Actually under 40ms delay there also exists MAX PTO, the only difference between various loss rate is the probability of its occurrence
I notice the update of "Count cipherOverhead into Quic packet builder packet size" which has solved my problem, thanks for your contribution. But after applying this modification, I find "max_size exceeded in small_vector" occurred with small probability. It seems similar to “Problem about ackBlock under high loss rate #149 ”, and can you give me some advice about that?