msquic
msquic copied to clipboard
AppData/WithSend0RttArgs1.Send0Rtt/* failed
Describe the bug
I've been seeing a lot of failures with output like the following:
D:\a\1\msquic\src\test\lib\DataTest.cpp(149): error: 0-RTT wasn't used for stream data.
See: https://dev.azure.com/ms/msquic/_build/results?buildId=272959&view=logs&j=b7f0e904-2cea-585b-8e20-67062f2bef10&t=81f432f0-d5f0-5258-5253-0dcccc920e38&l=1547
It seems like a different test case every time, but has definitely picked up recently. Something has likely regressed things, perhaps some of the work in WAN perf.
Affected OS
- [X] Windows Server 2022
- [X] Windows 11
- [X] Windows Insider Preview (specify affected build below)
- [X] Ubuntu
- [X] Debian
- [ ] Other (specify below)
Additional OS information
No response
MsQuic version
main
Steps taken to reproduce bug
Run Azure Pipelines BVTs for OpenSSL.
Expected behavior
All tests should pass
Actual outcome
The send 0-RTT tests sometime fail because 0-RTT wasn't used for all streams.
Additional details
No response
I took a look at two different logs from failures in the release/2.0 weekly run for 3 different variables of the Send0Rtt tests (here. It looks like the packets end up getting retransmitted because they were somehow lost the first time. When they are retransmitted, they end up using the 1-RTT keys instead. The problem is that the logs are incomplete, meaning some were dropped. I wonder if the CPU usage sometimes goes too high on the VMs, which results in log drop; but could also somehow be the cause of the packet drop?
6379 [0]0E40.157C::2022-04-18 00:40:56.537 [Microsoft-Quic][strm][0x2C6610C68D0] Recovering offset 0 up to 662
6386 [0]0E40.157C::2022-04-18 00:40:56.537 [Microsoft-Quic][strm][0x2C6610C68D0] Recovering offset 662 up to 1000
6488 [2]0E40.157C::2022-04-18 00:40:56.538 [Microsoft-Quic][strm][0x2C6610C68D0] Built stream frame, offset=0 len=709 fin=0
6508 [1]0E40.157C::2022-04-18 00:40:56.538 [Microsoft-Quic][strm][0x2C6610C68D0] Built stream frame, offset=709 len=291 fin=1
6731 [2]0E40.157C::2022-04-18 00:40:56.539 [Microsoft-Quic][strm][0x2C66114DD70] Allocated, Conn=0x2C66114E230
6735 [2]0E40.157C::2022-04-18 00:40:56.539 [Microsoft-Quic][conn][0x2C66114E230] Indicating QUIC_CONNECTION_EVENT_PEER_STREAM_STARTED [000002C66114DD70, 0x1]
6740 [2]0E40.157C::2022-04-18 00:40:56.539 [Microsoft-Quic][strm][0x2C66114DD70] Received 709 bytes, offset=0 Ready=1
6750 [2]0E40.157C::2022-04-18 00:40:56.539 [Microsoft-Quic][strm][0x2C66114DD70] Received 291 bytes, offset=709 Ready=1
6801 [2]0E40.157C::2022-04-18 00:40:56.539 [Microsoft-Quic][strm][0x2C66114DD70] Indicating QUIC_STREAM_EVENT_RECEIVE [1000 bytes, 1 buffers, 2 flags]
repro on Linux (doesn't happen with gdb for some reason)
while :; do
./artifacts/bin/linux/x64_Release_openssl/msquictest --gtest_filter="AppData/WithSend0RttArgs1.Send0Rtt/*" --gtest_break_on_failure
if [ $? -ne 0 ]; then
break
fi
done
core dump seems not having richer data. below is lttng trace generator
echo $@
dirprefix="msquic_lttng"
num=`find ./* -maxdepth 0 -name "$dirprefix*" | wc -l`
dirname=$dirprefix$num
while :; do
mkdir -p $dirname/data && lttng create msquic -o=./$dirname/data && sessionCreated=$? && lttng enable-event --userspace "CLOG_*" && lttng add-context --userspace --type=vpid --type=vtid && lttng start
testResult=0;
if [ $? -eq 0 ]; then
./artifacts/bin/linux/x64_Debug_openssl/msquictest --gtest_filter="AppData/WithSend0RttArgs1.Send0Rtt/*" --gtest_break_on_failure
testResult=$?
lttng stop
fi
if [ $sessionCreated -eq 0 ]; then
lttng destroy msquic
fi
if [ $testResult -ne 0 ]; then
babeltrace --names all $dirname/data > $dirname/quic.babel.txt
${HOME}/.dotnet/tools/clog2text_lttng -i $dirname/quic.babel.txt -s ./src/manifest/clog.sidecar -o $dirname/quic.log --showTimestamp --showCpuInfo
break
fi
rm -rf msquic_lttng0
done
Could repro with QUIC_EXECUTION_PROFILE_TYPE_SCAVENGER option.
These three files are from AppData/WithSend0RttArgs1.Send0Rtt/167
167fail_scaven.log happened middle of tests AppData/WithSend0RttArgs1.Send0Rtt/* so this doesn't include initialization phase.
There is DROP packet by "No matching client connection", but interesting part is its Dst and Src. This sends same port, but there is no such case in successful test as far as I checked
[0][236e.255b][02:44:26.423092][data][0x7f3d3c0b37e0] Send 1200 bytes in 1 buffers (segment=1200) Dst=[::1]:42900, Src=[::1]:42900
[0][236e.255b][02:44:26.423100][pack][1099511632313] Batch sent
[0][236e.255b][02:44:26.423101][conn][0x55d9353a6f30] Setting loss detection 0 timer for 998989 us. (ProbeCount=0)
[0][236e.255b][02:44:26.423102][conn][0x55d9353a6f30] Setting 2, delay=998989 us
[0][236e.255b][02:44:26.423102][time][0x55d935244fb0] Updating Connection 0x55d9353a6f30.
[0][236e.255b][02:44:26.423102][time][0x55d935244fb0] Next Expiration = {20359904118, 0x55d9353a6f30}.
[0][236e.255b][02:44:26.423102][conn][0x55d9353a6f30] Flush complete flags=0x0
[0][236e.255b][02:44:26.423103][conn][0x55d9353a6f30] Scheduling: 0
[0][236e.255b][02:44:26.423109][data][0x7f3d3c0b37e0] Recv 1200 bytes (segment=0) Src=[::1]:42900 Dst=[::1]:42900
[0][236e.255b][02:44:26.423109][pack][1099511633229] Received
[0][236e.255b][02:44:26.423111][bind][0x7f3d3c083380] DROP packet Dst=[::1]:42900 Src=[::1]:42900 Reason=No matching client connection.
https://github.com/microsoft/msquic/blob/335d3e96f6f8350ae5c2cf385eb2040e939e8d6c/src/platform/datapath_epoll.c#L1237-L1259 Here seems to be the root cause. Loopback might need additional check or completely disable
https://github.com/microsoft/msquic/blob/335d3e96f6f8350ae5c2cf385eb2040e939e8d6c/src/platform/datapath_epoll.c#L1233-L1237 What does this mean? I believe the error comes from client side
Wow. The client and server are running on the same port? We're supposed to have some code that protects this scenario.
See QUIC_SHARED_EPHEMERAL_WORKAROUND in the core code.
Perhaps that code has a bug when one binding is connected (have both local and remote addresses) and when the other is just a listening binding/socket (just local address). I bet that's it. Fix that, and we should be good.
Closed by https://github.com/microsoft/msquic/pull/3439.