grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

grpc v1.46+ Dial failure when connecting over an existing connection

Open noamApps opened this issue 3 years ago • 0 comments

Please see the FAQ in our main README.md, then answer the questions below before submitting your issue.

What version of gRPC are you using?

1.46+

What version of Go are you using (go version)?

1.17

What operating system (Linux, Windows, …) and version?

linux

What did you do?

We have a solution that uses an existing websocket connection to initiate a grpc connection that works in the following way:

  • server A dials to Server B, establishing a websocket connection
  • server B uses this connection to do a grpc dial to Server A

The code roughly looks like this:

var wsConn net.Conn
// initialize wsConn...
grpcConn, err := grpc.DialContext(ctx, "", []grpc.DialOption{
    grpc.WithTransportCredentials(insecure.NewCredentials()),
    grpc.WithContextDialer(func(ctx context.Context, addr string) (net.Conn, error) {
        return wsConn, nil
    }),
}...,)

up until (but not including) v1.46 this worked perfectly, starting v1.46, this worked perfectly, starting with version v1.46, the grpc connection fails with TRANSIENT_FAILURE message, that seems to be happening after a timeout

What did you expect to see?

here are logs of a working connection in version 1.45

2022/08/08 11:23:46 INFO: [core] original dial target is: ""
2022/08/08 11:23:46 INFO: [core] parsed dial target is: {Scheme: Authority: Endpoint: URL:{Scheme: Opaque: User: Host: Path: RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022/08/08 11:23:46 INFO: [core] fallback to scheme "passthrough"
2022/08/08 11:23:46 INFO: [core] parsed dial target is: {Scheme:passthrough Authority: Endpoint: URL:{Scheme:passthrough Opaque: User: Host: Path:/ RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022/08/08 11:23:46 INFO: [core] Channel authority set to ""
2022/08/08 11:23:46 INFO: [core] ccResolverWrapper: sending update to cc: {[{  <nil> <nil> 0 <nil>}] <nil> <nil>}
2022/08/08 11:23:46 INFO: [core] ClientConn switching balancer to "pick_first"
2022/08/08 11:23:46 INFO: [core] Channel switches to new LB policy "pick_first"
2022/08/08 11:23:46 INFO: [core] Subchannel Connectivity change to CONNECTING
2022/08/08 11:23:46 INFO: [core] Subchannel picks a new address "" to connect
2022/08/08 11:23:46 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc000535740, {CONNECTING <nil>}
2022/08/08 11:23:46 INFO: [core] Channel Connectivity change to CONNECTING
2022/08/08 11:23:51 INFO: [core] Subchannel Connectivity change to READY
2022/08/08 11:23:51 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc000535740, {READY <nil>}
2022/08/08 11:23:51 INFO: [core] Channel Connectivity change to READY

What did you see instead?

here's what happens in v1.46

2022/08/08 11:26:31 INFO: [core] [Channel #2] Channel created
2022/08/08 11:26:31 INFO: [core] [Channel #2] original dial target is: ""
2022/08/08 11:26:31 INFO: [core] [Channel #2] parsed dial target is: {Scheme: Authority: Endpoint: URL:{Scheme: Opaque: User: Host: Path: RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022/08/08 11:26:31 INFO: [core] [Channel #2] fallback to scheme "passthrough"
2022/08/08 11:26:31 INFO: [core] [Channel #2] parsed dial target is: {Scheme:passthrough Authority: Endpoint: URL:{Scheme:passthrough Opaque: User: Host: Path:/ RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022/08/08 11:26:31 INFO: [core] [Channel #2] Channel authority set to ""
2022/08/08 11:26:31 INFO: [core] [Channel #2] Resolver state updated: {
  "Addresses": [
    {
      "Addr": "",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Type": 0,
      "Metadata": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
2022/08/08 11:26:31 INFO: [core] [Channel #2] Channel switches to new LB policy "pick_first"
2022/08/08 11:26:31 INFO: [core] [Channel #2 SubChannel #3] Subchannel created
2022/08/08 11:26:31 INFO: [core] [Channel #2 SubChannel #3] Subchannel Connectivity change to CONNECTING
2022/08/08 11:26:31 INFO: [core] [Channel #2 SubChannel #3] Subchannel picks a new address "" to connect
2022/08/08 11:26:31 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc0002d8de0, {CONNECTING <nil>}
2022/08/08 11:26:31 INFO: [core] [Channel #2] Channel Connectivity change to CONNECTING
2022/08/08 11:26:51 WARNING: [core] [Channel #2 SubChannel #3] grpc: addrConn.createTransport failed to connect to {
  "Addr": "",
  "ServerName": "",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}: failed to receive server preface within timeout
2022/08/08 11:26:51 INFO: [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022/08/08 11:26:51 INFO: [core] [Channel #2 SubChannel #3] Subchannel Connectivity change to TRANSIENT_FAILURE

noamApps avatar Aug 08 '22 11:08 noamApps

failed to receive server preface within timeout

This means the server (A, in this case) isn't sending an HTTP/2 settings frame at the start of the connection for some reason. From looking at commits, I can't see anything in the 1.46 release that could have caused this kind of regression. Do you have a reproducible test case that I could run?

dfawley avatar Aug 16 '22 20:08 dfawley

No problem to create one, I'll do it and post it here, thank you!

noamApps avatar Aug 17 '22 12:08 noamApps

Hey @dfawley, Here (https://gist.github.com/noamApps/00946dd8a382d8e7b3778d8a407b7e1a) is a gist containing a setup that demonstrate the problem by printing the grpc connection state, added instructions as well, it is a simplified version of this demo: https://github.com/glerchundi/grpc-boomerang

noamApps avatar Aug 17 '22 19:08 noamApps

Thanks for the repro case.

I have identified your bug: your singleListener's Addr() method never returns, because conn is uninitialized when called before the first Accept call (which is what we do in gRPC-Go as of 1.46). If this returns a valid net.Addr instead, everything works. E.g. by changing the following to this:

	l := &singleListener{
		conn: <-sockets,  /* was: newConn: sockets */
	}
	grpcServer := grpc.NewServer()
	grpcServer.Serve(l)
}


func (s *singleListener) Accept() (net.Conn, error) {
	// removed: s.conn = <-s.newConn
	return s.conn, nil
}

Let me know if you have any questions.

dfawley avatar Aug 17 '22 20:08 dfawley

Hey @dfawley Thank you for looking into it and indeed the connection state is now ready, however, the moment I try to use the established grpc connection with a simple Hello Service, I get the following error (in the grpc server component)

WARNING: [core] [Server #1] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"\\x00\\x00\\x00\\x04\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x04\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x04\\x01\\x00\""
WARNING: [transport] transport: http2Server.HandleStreams failed to read frame: io: read/write on closed pipe

followed by

WARNING: [core] grpc: Server.Serve failed to create ServerTransport:  connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: io: read/write on closed pipe"
(endlessly)

The weird thing is that it is happening with grpc 1.45, so I think there is some sort of side-effect to the socket change suggested that I can't quite understand (yet), any ideas? I can send and updated gist if needed that adds the Hello functionality.

Again, thanks for the help so far.

noamApps avatar Aug 21 '22 14:08 noamApps