msquic icon indicating copy to clipboard operation
msquic copied to clipboard

AppData/WithSend0RttArgs1.Send0Rtt/* failed

Open nibanks opened this issue 3 years ago • 1 comments

Describe the bug

I've been seeing a lot of failures with output like the following:

D:\a\1\msquic\src\test\lib\DataTest.cpp(149): error: 0-RTT wasn't used for stream data.

See: https://dev.azure.com/ms/msquic/_build/results?buildId=272959&view=logs&j=b7f0e904-2cea-585b-8e20-67062f2bef10&t=81f432f0-d5f0-5258-5253-0dcccc920e38&l=1547

It seems like a different test case every time, but has definitely picked up recently. Something has likely regressed things, perhaps some of the work in WAN perf.

Affected OS

  • [X] Windows Server 2022
  • [X] Windows 11
  • [X] Windows Insider Preview (specify affected build below)
  • [X] Ubuntu
  • [X] Debian
  • [ ] Other (specify below)

Additional OS information

No response

MsQuic version

main

Steps taken to reproduce bug

Run Azure Pipelines BVTs for OpenSSL.

Expected behavior

All tests should pass

Actual outcome

The send 0-RTT tests sometime fail because 0-RTT wasn't used for all streams.

Additional details

No response

nibanks avatar Feb 14 '22 14:02 nibanks

I took a look at two different logs from failures in the release/2.0 weekly run for 3 different variables of the Send0Rtt tests (here. It looks like the packets end up getting retransmitted because they were somehow lost the first time. When they are retransmitted, they end up using the 1-RTT keys instead. The problem is that the logs are incomplete, meaning some were dropped. I wonder if the CPU usage sometimes goes too high on the VMs, which results in log drop; but could also somehow be the cause of the packet drop?

6379 [0]0E40.157C::‎2022‎-‎04‎-‎18 00:40:56.537 [Microsoft-Quic][strm][0x2C6610C68D0] Recovering offset 0 up to 662 
6386 [0]0E40.157C::‎2022‎-‎04‎-‎18 00:40:56.537 [Microsoft-Quic][strm][0x2C6610C68D0] Recovering offset 662 up to 1000 
6488 [2]0E40.157C::‎2022‎-‎04‎-‎18 00:40:56.538 [Microsoft-Quic][strm][0x2C6610C68D0] Built stream frame, offset=0 len=709 fin=0 
6508 [1]0E40.157C::‎2022‎-‎04‎-‎18 00:40:56.538 [Microsoft-Quic][strm][0x2C6610C68D0] Built stream frame, offset=709 len=291 fin=1 

6731 [2]0E40.157C::‎2022‎-‎04‎-‎18 00:40:56.539 [Microsoft-Quic][strm][0x2C66114DD70] Allocated, Conn=0x2C66114E230 
6735 [2]0E40.157C::‎2022‎-‎04‎-‎18 00:40:56.539 [Microsoft-Quic][conn][0x2C66114E230] Indicating QUIC_CONNECTION_EVENT_PEER_STREAM_STARTED [000002C66114DD70, 0x1] 
6740 [2]0E40.157C::‎2022‎-‎04‎-‎18 00:40:56.539 [Microsoft-Quic][strm][0x2C66114DD70] Received 709 bytes, offset=0 Ready=1 
6750 [2]0E40.157C::‎2022‎-‎04‎-‎18 00:40:56.539 [Microsoft-Quic][strm][0x2C66114DD70] Received 291 bytes, offset=709 Ready=1 
6801 [2]0E40.157C::‎2022‎-‎04‎-‎18 00:40:56.539 [Microsoft-Quic][strm][0x2C66114DD70] Indicating QUIC_STREAM_EVENT_RECEIVE [1000 bytes, 1 buffers, 2 flags] 

quic (2).log

nibanks avatar Apr 18 '22 14:04 nibanks

repro on Linux (doesn't happen with gdb for some reason)

while :; do
    ./artifacts/bin/linux/x64_Release_openssl/msquictest --gtest_filter="AppData/WithSend0RttArgs1.Send0Rtt/*" --gtest_break_on_failure
    if [ $? -ne 0 ]; then
        break
    fi
done

ami-GS avatar Feb 12 '23 03:02 ami-GS

core dump seems not having richer data. below is lttng trace generator

echo $@
dirprefix="msquic_lttng"
num=`find ./* -maxdepth 0 -name "$dirprefix*" | wc -l`
dirname=$dirprefix$num
while :; do
    mkdir -p $dirname/data && lttng create msquic -o=./$dirname/data && sessionCreated=$? && lttng enable-event --userspace "CLOG_*" && lttng add-context --userspace --type=vpid --type=vtid && lttng start
    testResult=0;
    if [ $? -eq 0 ]; then
        ./artifacts/bin/linux/x64_Debug_openssl/msquictest --gtest_filter="AppData/WithSend0RttArgs1.Send0Rtt/*" --gtest_break_on_failure
        testResult=$?
        lttng stop
    fi
    if [ $sessionCreated -eq 0 ]; then
        lttng destroy msquic
    fi

    if [ $testResult -ne 0 ]; then
        babeltrace --names all $dirname/data > $dirname/quic.babel.txt
        ${HOME}/.dotnet/tools/clog2text_lttng -i $dirname/quic.babel.txt -s ./src/manifest/clog.sidecar -o $dirname/quic.log --showTimestamp --showCpuInfo
        break
    fi
    rm -rf msquic_lttng0
done

ami-GS avatar Feb 13 '23 19:02 ami-GS

quic.zip

ami-GS avatar Feb 13 '23 20:02 ami-GS

Could repro with QUIC_EXECUTION_PROFILE_TYPE_SCAVENGER option. These three files are from AppData/WithSend0RttArgs1.Send0Rtt/167 167fail_scaven.log happened middle of tests AppData/WithSend0RttArgs1.Send0Rtt/* so this doesn't include initialization phase.

167success.log 167fail_scaven.log 167success_scaven.log

ami-GS avatar Feb 13 '23 23:02 ami-GS

There is DROP packet by "No matching client connection", but interesting part is its Dst and Src. This sends same port, but there is no such case in successful test as far as I checked

[0][236e.255b][02:44:26.423092][data][0x7f3d3c0b37e0] Send 1200 bytes in 1 buffers (segment=1200) Dst=[::1]:42900, Src=[::1]:42900
[0][236e.255b][02:44:26.423100][pack][1099511632313] Batch sent
[0][236e.255b][02:44:26.423101][conn][0x55d9353a6f30] Setting loss detection 0 timer for 998989 us. (ProbeCount=0)
[0][236e.255b][02:44:26.423102][conn][0x55d9353a6f30] Setting 2, delay=998989 us
[0][236e.255b][02:44:26.423102][time][0x55d935244fb0] Updating Connection 0x55d9353a6f30.
[0][236e.255b][02:44:26.423102][time][0x55d935244fb0] Next Expiration = {20359904118, 0x55d9353a6f30}.
[0][236e.255b][02:44:26.423102][conn][0x55d9353a6f30] Flush complete flags=0x0
[0][236e.255b][02:44:26.423103][conn][0x55d9353a6f30] Scheduling: 0
[0][236e.255b][02:44:26.423109][data][0x7f3d3c0b37e0] Recv 1200 bytes (segment=0) Src=[::1]:42900 Dst=[::1]:42900
[0][236e.255b][02:44:26.423109][pack][1099511633229] Received
[0][236e.255b][02:44:26.423111][bind][0x7f3d3c083380] DROP packet Dst=[::1]:42900 Src=[::1]:42900 Reason=No matching client connection.

ami-GS avatar Feb 14 '23 01:02 ami-GS

https://github.com/microsoft/msquic/blob/335d3e96f6f8350ae5c2cf385eb2040e939e8d6c/src/platform/datapath_epoll.c#L1237-L1259 Here seems to be the root cause. Loopback might need additional check or completely disable

ami-GS avatar Feb 15 '23 00:02 ami-GS

https://github.com/microsoft/msquic/blob/335d3e96f6f8350ae5c2cf385eb2040e939e8d6c/src/platform/datapath_epoll.c#L1233-L1237 What does this mean? I believe the error comes from client side

ami-GS avatar Feb 15 '23 00:02 ami-GS

Wow. The client and server are running on the same port? We're supposed to have some code that protects this scenario.

nibanks avatar Feb 15 '23 00:02 nibanks

See QUIC_SHARED_EPHEMERAL_WORKAROUND in the core code.

nibanks avatar Feb 15 '23 00:02 nibanks

Perhaps that code has a bug when one binding is connected (have both local and remote addresses) and when the other is just a listening binding/socket (just local address). I bet that's it. Fix that, and we should be good.

nibanks avatar Feb 15 '23 00:02 nibanks

Closed by https://github.com/microsoft/msquic/pull/3439.

nibanks avatar Feb 17 '23 13:02 nibanks