grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

Flaky Test: TestFullHandshake

Open arvindbr8 opened this issue 2 years ago • 3 comments

Saw this flake recently on a presubmit run. A re-run made it go away. Opening this issue to track if this comes up again.

--- FAIL: Test/FullHandshake (20.12s)
        tlogger.go:116: INFO server.go:629 [core] [Server #1] Server created  (t=+20.193184ms)
        tlogger.go:116: INFO server.go:817 [core] [Server #1 ListenSocket #2] ListenSocket created  (t=+23.45913ms)
        tlogger.go:116: INFO server.go:629 [core] [Server #3] Server created  (t=+25.477758ms)
        tlogger.go:116: INFO server.go:817 [core] [Server #3 ListenSocket #4] ListenSocket created  (t=+25.974065ms)
        tlogger.go:116: INFO clientconn.go:177 [core] [Channel #5] Channel created  (t=+27.240883ms)
        tlogger.go:116: INFO clientconn.go:1628 [core] [Channel #5] original dial target is: "127.0.0.1:34515"  (t=+28.171396ms)
        tlogger.go:116: INFO clientconn.go:1633 [core] [Channel #5] dial target "127.0.0.1:34515" parse failed: parse "127.0.0.1:34515": first path segment in URL cannot contain colon  (t=+28.847306ms)
        tlogger.go:116: INFO clientconn.go:1649 [core] [Channel #5] fallback to scheme "passthrough"  (t=+29.002208ms)
        tlogger.go:116: INFO clientconn.go:1657 [core] [Channel #5] parsed dial target is: {Scheme:passthrough Authority: URL:{Scheme:passthrough Opaque: User: Host: Path:/127.0.0.1:34515 RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}  (t=+30.920035ms)
        tlogger.go:116: INFO clientconn.go:1735 [core] [Channel #5] Channel authority set to "127.0.0.1:34515"  (t=+31.404241ms)
        tlogger.go:116: INFO resolver_conn_wrapper.go:192 [core] [Channel #5] Resolver state updated: {
              "Addresses": [
                {
                  "Addr": "127.0.0.1:34515",
                  "ServerName": "",
                  "Attributes": null,
                  "BalancerAttributes": null,
                  "Type": 0,
                  "Metadata": null
                }
              ],
              "ServiceConfig": null,
              "Attributes": null
            } (resolver returned new addresses)  (t=+37.599329ms)
        tlogger.go:116: INFO balancer_conn_wrappers.go:165 [core] [Channel #5] Channel switches to new LB policy "pick_first"  (t=+39.878261ms)
        tlogger.go:116: INFO clientconn.go:760 [core] [Channel #5 SubChannel #6] Subchannel created  (t=+41.824788ms)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to CONNECTING  (t=+42.595399ms)
        tlogger.go:116: INFO clientconn.go:1146 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to CONNECTING  (t=+43.230008ms)
        tlogger.go:116: INFO clientconn.go:1260 [core] [Channel #5 SubChannel #6] Subchannel picks a new address "127.0.0.1:34515" to connect  (t=+43.988118ms)
        tlogger.go:116: INFO clientconn.go:177 [core] [Channel #7] Channel created  (t=+49.337894ms)
        tlogger.go:116: INFO clientconn.go:1628 [core] [Channel #7] original dial target is: "127.0.0.1:39445"  (t=+50.690513ms)
        tlogger.go:116: INFO clientconn.go:1633 [core] [Channel #7] dial target "127.0.0.1:39445" parse failed: parse "127.0.0.1:39445": first path segment in URL cannot contain colon  (t=+51.101518ms)
        tlogger.go:116: INFO clientconn.go:1649 [core] [Channel #7] fallback to scheme "passthrough"  (t=+51.21942ms)
        tlogger.go:116: INFO clientconn.go:1657 [core] [Channel #7] parsed dial target is: {Scheme:passthrough Authority: URL:{Scheme:passthrough Opaque: User: Host: Path:/127.0.0.1:39445 RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}  (t=+51.514424ms)
        tlogger.go:116: INFO clientconn.go:1735 [core] [Channel #7] Channel authority set to "127.0.0.1:39445"  (t=+51.660626ms)
        tlogger.go:116: INFO resolver_conn_wrapper.go:192 [core] [Channel #7] Resolver state updated: {
              "Addresses": [
                {
                  "Addr": "127.0.0.1:39445",
                  "ServerName": "",
                  "Attributes": null,
                  "BalancerAttributes": null,
                  "Type": 0,
                  "Metadata": null
                }
              ],
              "ServiceConfig": null,
              "Attributes": null
            } (resolver returned new addresses)  (t=+56.711497ms)
        tlogger.go:116: INFO balancer_conn_wrappers.go:165 [core] [Channel #7] Channel switches to new LB policy "pick_first"  (t=+57.075602ms)
        tlogger.go:116: INFO clientconn.go:760 [core] [Channel #7 SubChannel #8] Subchannel created  (t=+57.505509ms)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #7] Channel Connectivity change to CONNECTING  (t=+57.71[79](https://github.com/grpc/grpc-go/actions/runs/4988867731/jobs/8932073496#step:8:80)12ms)
        tlogger.go:116: INFO clientconn.go:1146 [core] [Channel #7 SubChannel #8] Subchannel Connectivity change to CONNECTING  (t=+61.245961ms)
        tlogger.go:116: INFO clientconn.go:1260 [core] [Channel #7 SubChannel #8] Subchannel picks a new address "127.0.0.1:39445" to connect  (t=+61.760068ms)
        tlogger.go:116: INFO clientconn.go:1146 [core] [Channel #7 SubChannel #8] Subchannel Connectivity change to READY  (t=+81.970553ms)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #7] Channel Connectivity change to READY  (t=+82.49296ms)
        tlogger.go:116: INFO server.go:932 [core] [Server #3] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "ServerHandshake(\"127.0.0.1:58176\") failed: rpc error: code = Unknown desc = processNext failure: unexpected in bytes: got: [67 108 105 101 110 116 70 105 110 105 115 104 101 100 44 0 0 0 6 0 0 0 224 132 120 35 97 38 210 56 213 235 167 222 29 147 103 49 0 227 247 52 40 168 203 65 87 134 168 121 61 133 108 52 235 103 208 244 154 79 210 166 29 0 0 0 6 0 0 0 113 90 216 11 71 147 44 214 233 189 226 211 166 249 231 235 195 168 133 179 109 161 147 152 87], want: [67 108 105 101 110 116 70 105 110 105 115 104 101 100]"  (t=+151.47433ms)
        tlogger.go:116: WARNING clientconn.go:1321 [core] [Channel #5 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:34515", ServerName: "127.0.0.1:34515", }. Err: connection error: desc = "error reading server preface: EOF"  (t=+154.133467ms)
        tlogger.go:116: INFO clientconn.go:1148 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "error reading server preface: EOF"  (t=+154.753376ms)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to TRANSIENT_FAILURE  (t=+155.00218ms)
        tlogger.go:116: INFO clientconn.go:1148 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to IDLE, last error: connection error: desc = "error reading server preface: EOF"  (t=+1.15557845s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to IDLE  (t=+1.156143158s)
        tlogger.go:116: INFO clientconn.go:1146 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to CONNECTING  (t=+1.156450862s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to CONNECTING  (t=+1.156[80](https://github.com/grpc/grpc-go/actions/runs/4988867731/jobs/8932073496#step:8:81)3967s)
        tlogger.go:116: INFO clientconn.go:1260 [core] [Channel #5 SubChannel #6] Subchannel picks a new address "127.0.0.1:34515" to connect  (t=+1.157490577s)
        tlogger.go:116: INFO server.go:932 [core] [Server #3] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "ServerHandshake(\"127.0.0.1:5[81](https://github.com/grpc/grpc-go/actions/runs/4988867731/jobs/8932073496#step:8:82)90\") failed: rpc error: code = Unknown desc = processNext failure: unexpected in bytes: got: [67 108 105 101 110 116 70 105 110 105 115 104 101 100 44 0 0 0 6 0 0 0 224 132 120 35 97 38 210 56 213 235 167 222 29 147 103 49 0 227 247 52 40 168 203 65 87 134 168 121 61 133 108 52 235 103 208 244 154 79 210 166 29 0 0 0 6 0 0 0 113 90 216 11 71 147 44 214 233 189 226 211 166 249 231 235 195 168 133 179 109 161 147 152 87], want: [67 108 105 101 110 116 70 105 110 105 115 104 101 100]"  (t=+1.163057155s)
        tlogger.go:116: WARNING clientconn.go:1321 [core] [Channel #5 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:34515", ServerName: "127.0.0.1:34515", }. Err: connection error: desc = "error reading server preface: EOF"  (t=+1.163527261s)
        tlogger.go:116: INFO clientconn.go:1148 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "error reading server preface: EOF"  (t=+1.163651663s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to TRANSIENT_FAILURE  (t=+1.163792865s)
        tlogger.go:116: INFO clientconn.go:1148 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to IDLE, last error: connection error: desc = "error reading server preface: EOF"  (t=+2.745313905s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to IDLE  (t=+2.745486907s)
        tlogger.go:116: INFO clientconn.go:1146 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to CONNECTING  (t=+2.750087372s)
        tlogger.go:116: INFO clientconn.go:1260 [core] [Channel #5 SubChannel #6] Subchannel picks a new address "127.0.0.1:34515" to connect  (t=+2.750340575s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to CONNECTING  (t=+2.750[82](https://github.com/grpc/grpc-go/actions/runs/4988867731/jobs/8932073496#step:8:83)1882s)
        tlogger.go:116: INFO server.go:932 [core] [Server #3] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "ServerHandshake(\"127.0.0.1:58206\") failed: rpc error: code = Unknown desc = processNext failure: unexpected in bytes: got: [67 108 105 101 110 116 70 105 110 105 115 104 101 100 44 0 0 0 6 0 0 0 224 132 120 35 97 38 210 56 213 235 167 222 29 147 103 49 0 227 247 52 40 168 203 65 87 134 168 121 61 133 108 52 235 103 208 244 154 79 210 166 29 0 0 0 6 0 0 0 113 90 216 11 71 147 44 214 233 189 226 211 166 249 231 235 195 168 133 179 109 161 147 152 87], want: [67 108 105 101 110 116 70 105 110 105 115 104 101 100]"  (t=+2.756089656s)
        tlogger.go:116: WARNING clientconn.go:1321 [core] [Channel #5 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:34515", ServerName: "127.0.0.1:34515", }. Err: connection error: desc = "error reading server preface: EOF"  (t=+2.756556863s)
        tlogger.go:116: INFO clientconn.go:1148 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "error reading server preface: EOF"  (t=+2.756684965s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to TRANSIENT_FAILURE  (t=+2.756800666s)
        tlogger.go:116: INFO clientconn.go:1148 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to IDLE, last error: connection error: desc = "error reading server preface: EOF"  (t=+5.630322174s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to IDLE  (t=+5.630603778s)
        tlogger.go:116: INFO clientconn.go:1146 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to CONNECTING  (t=+5.63860729s)
        tlogger.go:116: INFO clientconn.go:1260 [core] [Channel #5 SubChannel #6] Subchannel picks a new address "127.0.0.1:34515" to connect  (t=+5.638802793s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to CONNECTING  (t=+5.638978996s)
        tlogger.go:116: INFO server.go:932 [core] [Server #3] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "ServerHandshake(\"127.0.0.1:58220\") failed: rpc error: code = Unknown desc = processNext failure: unexpected in bytes: got: [67 108 105 101 110 116 70 105 110 105 115 104 101 100 44 0 0 0 6 0 0 0 224 132 120 35 97 38 210 56 213 235 167 222 29 147 103 49 0 227 247 52 40 168 203 65 87 134 168 121 61 133 108 52 235 103 208 244 154 79 210 166 29 0 0 0 6 0 0 0 113 90 216 11 71 147 44 214 233 189 226 211 166 249 231 235 195 168 133 179 109 161 147 152 87], want: [67 108 105 101 110 116 70 105 110 105 115 104 101 100]"  (t=+5.643249056s)
        tlogger.go:116: WARNING clientconn.go:1321 [core] [Channel #5 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:34515", ServerName: "127.0.0.1:34515", }. Err: connection error: desc = "error reading server preface: EOF"  (t=+5.643613961s)
        tlogger.go:116: INFO clientconn.go:1148 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "error reading server preface: EOF"  (t=+5.643856564s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to TRANSIENT_FAILURE  (t=+5.643987266s)
        tlogger.go:116: INFO clientconn.go:1148 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to IDLE, last error: connection error: desc = "error reading server preface: EOF"  (t=+10.047904767s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to IDLE  (t=+10.04809967s)
        tlogger.go:116: INFO clientconn.go:1146 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to CONNECTING  (t=+10.053256342s)
        tlogger.go:116: INFO clientconn.go:1260 [core] [Channel #5 SubChannel #6] Subchannel picks a new address "127.0.0.1:34515" to connect  (t=+10.053455945s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to CONNECTING  (t=+10.053603947s)
        tlogger.go:116: INFO clientconn.go:1146 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to READY  (t=+10.060843349s)
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to READY  (t=+10.061084552s)
        alts_test.go:356: c.UnaryCall() failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
        tlogger.go:116: INFO clientconn.go:451 [core] [Channel #5] Channel Connectivity change to SHUTDOWN  (t=+10.062399471s)
        tlogger.go:116: INFO clientconn.go:1146 [core] [Channel #5 SubChannel #6] Subchannel Connectivity change to SHUTDOWN  (t=+10.064078995s)
        tlogger.go:116: INFO clientconn.go:1496 [core] [Channel #5 SubChannel #6] Subchannel deleted  (t=+10.064369099s)
        tlogger.go:116: INFO clientconn.go:1096 [core] [Channel #5] Channel deleted  (t=+10.064527001s)
        tlogger.go:116: INFO server.go:768 [core] [Server #3 ListenSocket #4] ListenSocket deleted  (t=+10.066885534s)
        tlogger.go:116: INFO server.go:768 [core] [Server #1 ListenSocket #2] ListenSocket deleted  (t=+10.067759146s)
        tlogger.go:116: INFO clientconn.go:1146 [core] [Channel #7 SubChannel #8] Subchannel Connectivity change to IDLE  (t=+10.06[83](https://github.com/grpc/grpc-go/actions/runs/4988867731/jobs/8932073496#step:8:84)[90](https://github.com/grpc/grpc-go/actions/runs/4988867731/jobs/8932073496#step:8:91)655s)
        tlogger.go:[116](https://github.com/grpc/grpc-go/actions/runs/4988867731/jobs/8932073496#step:8:117): INFO clientconn.go:451 [core] [Channel #7] Channel Connectivity change to IDLE  (t=+10.074070535s)
        grpctest.go:39: Leaked goroutine: goroutine 21 [select]:
            google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run(0x400008da60, {0x602e58, 0x40007c8050})
            	/home/runner/work/grpc-go/grpc-go/internal/grpcsync/callback_serializer.go:67 +0xdc
            created by google.golang.org/grpc/internal/grpcsync.NewCallbackSerializer
            	/home/runner/work/grpc-go/grpc-go/internal/grpcsync/callback_serializer.go:52 +0x13c
        grpctest.go:39: Leaked goroutine: goroutine 22 [select]:
            google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run(0x400008dab0, {0x602e58, 0x40007c80a0})
            	/home/runner/work/grpc-go/grpc-go/internal/grpcsync/callback_serializer.go:67 +0xdc
            created by google.golang.org/grpc/internal/grpcsync.NewCallbackSerializer
            	/home/runner/work/grpc-go/grpc-go/internal/grpcsync/callback_serializer.go:52 +0x13c
        grpctest.go:60: Leak check disabled for future tests
FAIL

run: https://github.com/grpc/grpc-go/actions/runs/4988867731/jobs/8932073496

arvindbr8 avatar May 17 '23 20:05 arvindbr8

This looks wrong:

https://github.com/grpc/grpc-go/blob/417d4b6895679bd9378cb37c2afecf6a292eb267/credentials/alts/alts_test.go#L324-L327

Can you take a look, please @matthewstevenson88? This test is flaking which can delay our development cycles.

dfawley avatar May 17 '23 22:05 dfawley

Thanks for flagging this, I will take a look today.

matthewstevenson88 avatar May 18 '23 14:05 matthewstevenson88

I've sent #6300 to fix this issue. Thanks again for finding the flake!

matthewstevenson88 avatar May 18 '23 18:05 matthewstevenson88

I see a similar/same leak in an upgrade deps PR in our project: https://github.com/googleapis/gapic-generator-go/pull/1362

--- FAIL: TestComplianceSuite (5.07s)
    leakcheck.go:111: Leaked goroutine: goroutine 4 [select]:
        google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run(0xc000080480, {0xdc0458, 0xc000037540})
        	/home/runner/go/pkg/mod/google.golang.org/[email protected]/internal/grpcsync/callback_serializer.go:83 +0x12a
        created by google.golang.org/grpc/internal/grpcsync.NewCallbackSerializer
        	/home/runner/go/pkg/mod/google.golang.org/[email protected]/internal/grpcsync/callback_serializer.go:55 +0x138
    leakcheck.go:111: Leaked goroutine: goroutine 5 [select]:
        google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run(0xc0000804e0, {0xdc0458, 0xc000037580})
        	/home/runner/go/pkg/mod/google.golang.org/[email protected]/internal/grpcsync/callback_serializer.go:83 +0x12a
        created by google.golang.org/grpc/internal/grpcsync.NewCallbackSerializer
        	/home/runner/go/pkg/mod/google.golang.org/[email protected]/internal/grpcsync/callback_serializer.go:55 +0x138

codyoss avatar Jun 23 '23 20:06 codyoss

Sorry this issue should have probably been closed when #6300 got merged.

@codyoss Is there anything in your failing test that indicates that the problem is related to ALTS? Or is the similarity just that the failure logs look similar?

matthewstevenson88 avatar Jun 26 '23 17:06 matthewstevenson88

Is there anything in your failing test that indicates that the problem is related to ALTS?

Not that I know of, but I also don't know what that stands for.

Or is the similarity just that the failure logs look similar?

I had just noticed the leak referenced NewCallbackSerializer which was also in the call stack of this issue. I believe you should be able to view our test logs in the link, but I posted the full context we had from our leak checker, at least for this specific test(many failed do to what I assume is one issue.)

codyoss avatar Jun 26 '23 17:06 codyoss

Thanks @codyoss. Could you open a separate issue for this failure to better get the attention of the gRPC-Go team? I don't want the leak you noticed to get buried in this flaky test issue, which I think is unrelated.

matthewstevenson88 avatar Jun 26 '23 20:06 matthewstevenson88

@matthewstevenson88 done.

codyoss avatar Jun 26 '23 20:06 codyoss

Closing this issue since #6300 is merged.

matthewstevenson88 avatar Jun 26 '23 22:06 matthewstevenson88