grpc-go
grpc-go copied to clipboard
grpc v1.46+ Dial failure when connecting over an existing connection
Please see the FAQ in our main README.md, then answer the questions below before submitting your issue.
What version of gRPC are you using?
1.46+
What version of Go are you using (go version)?
1.17
What operating system (Linux, Windows, …) and version?
linux
What did you do?
We have a solution that uses an existing websocket connection to initiate a grpc connection that works in the following way:
- server A dials to Server B, establishing a websocket connection
- server B uses this connection to do a grpc dial to Server A
The code roughly looks like this:
var wsConn net.Conn
// initialize wsConn...
grpcConn, err := grpc.DialContext(ctx, "", []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithContextDialer(func(ctx context.Context, addr string) (net.Conn, error) {
return wsConn, nil
}),
}...,)
up until (but not including) v1.46 this worked perfectly, starting v1.46, this worked perfectly, starting with version v1.46, the grpc connection fails with TRANSIENT_FAILURE message, that seems to be happening after a timeout
What did you expect to see?
here are logs of a working connection in version 1.45
2022/08/08 11:23:46 INFO: [core] original dial target is: ""
2022/08/08 11:23:46 INFO: [core] parsed dial target is: {Scheme: Authority: Endpoint: URL:{Scheme: Opaque: User: Host: Path: RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022/08/08 11:23:46 INFO: [core] fallback to scheme "passthrough"
2022/08/08 11:23:46 INFO: [core] parsed dial target is: {Scheme:passthrough Authority: Endpoint: URL:{Scheme:passthrough Opaque: User: Host: Path:/ RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022/08/08 11:23:46 INFO: [core] Channel authority set to ""
2022/08/08 11:23:46 INFO: [core] ccResolverWrapper: sending update to cc: {[{ <nil> <nil> 0 <nil>}] <nil> <nil>}
2022/08/08 11:23:46 INFO: [core] ClientConn switching balancer to "pick_first"
2022/08/08 11:23:46 INFO: [core] Channel switches to new LB policy "pick_first"
2022/08/08 11:23:46 INFO: [core] Subchannel Connectivity change to CONNECTING
2022/08/08 11:23:46 INFO: [core] Subchannel picks a new address "" to connect
2022/08/08 11:23:46 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc000535740, {CONNECTING <nil>}
2022/08/08 11:23:46 INFO: [core] Channel Connectivity change to CONNECTING
2022/08/08 11:23:51 INFO: [core] Subchannel Connectivity change to READY
2022/08/08 11:23:51 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc000535740, {READY <nil>}
2022/08/08 11:23:51 INFO: [core] Channel Connectivity change to READY
What did you see instead?
here's what happens in v1.46
2022/08/08 11:26:31 INFO: [core] [Channel #2] Channel created
2022/08/08 11:26:31 INFO: [core] [Channel #2] original dial target is: ""
2022/08/08 11:26:31 INFO: [core] [Channel #2] parsed dial target is: {Scheme: Authority: Endpoint: URL:{Scheme: Opaque: User: Host: Path: RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022/08/08 11:26:31 INFO: [core] [Channel #2] fallback to scheme "passthrough"
2022/08/08 11:26:31 INFO: [core] [Channel #2] parsed dial target is: {Scheme:passthrough Authority: Endpoint: URL:{Scheme:passthrough Opaque: User: Host: Path:/ RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022/08/08 11:26:31 INFO: [core] [Channel #2] Channel authority set to ""
2022/08/08 11:26:31 INFO: [core] [Channel #2] Resolver state updated: {
"Addresses": [
{
"Addr": "",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}
],
"ServiceConfig": null,
"Attributes": null
} (resolver returned new addresses)
2022/08/08 11:26:31 INFO: [core] [Channel #2] Channel switches to new LB policy "pick_first"
2022/08/08 11:26:31 INFO: [core] [Channel #2 SubChannel #3] Subchannel created
2022/08/08 11:26:31 INFO: [core] [Channel #2 SubChannel #3] Subchannel Connectivity change to CONNECTING
2022/08/08 11:26:31 INFO: [core] [Channel #2 SubChannel #3] Subchannel picks a new address "" to connect
2022/08/08 11:26:31 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc0002d8de0, {CONNECTING <nil>}
2022/08/08 11:26:31 INFO: [core] [Channel #2] Channel Connectivity change to CONNECTING
2022/08/08 11:26:51 WARNING: [core] [Channel #2 SubChannel #3] grpc: addrConn.createTransport failed to connect to {
"Addr": "",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}: failed to receive server preface within timeout
2022/08/08 11:26:51 INFO: [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022/08/08 11:26:51 INFO: [core] [Channel #2 SubChannel #3] Subchannel Connectivity change to TRANSIENT_FAILURE
failed to receive server preface within timeout
This means the server (A, in this case) isn't sending an HTTP/2 settings frame at the start of the connection for some reason. From looking at commits, I can't see anything in the 1.46 release that could have caused this kind of regression. Do you have a reproducible test case that I could run?
No problem to create one, I'll do it and post it here, thank you!
Hey @dfawley, Here (https://gist.github.com/noamApps/00946dd8a382d8e7b3778d8a407b7e1a) is a gist containing a setup that demonstrate the problem by printing the grpc connection state, added instructions as well, it is a simplified version of this demo: https://github.com/glerchundi/grpc-boomerang
Thanks for the repro case.
I have identified your bug: your singleListener's Addr() method never returns, because conn is uninitialized when called before the first Accept call (which is what we do in gRPC-Go as of 1.46). If this returns a valid net.Addr instead, everything works. E.g. by changing the following to this:
l := &singleListener{
conn: <-sockets, /* was: newConn: sockets */
}
grpcServer := grpc.NewServer()
grpcServer.Serve(l)
}
func (s *singleListener) Accept() (net.Conn, error) {
// removed: s.conn = <-s.newConn
return s.conn, nil
}
Let me know if you have any questions.
Hey @dfawley Thank you for looking into it and indeed the connection state is now ready, however, the moment I try to use the established grpc connection with a simple Hello Service, I get the following error (in the grpc server component)
WARNING: [core] [Server #1] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"\\x00\\x00\\x00\\x04\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x04\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x04\\x01\\x00\""
WARNING: [transport] transport: http2Server.HandleStreams failed to read frame: io: read/write on closed pipe
followed by
WARNING: [core] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: io: read/write on closed pipe"
(endlessly)
The weird thing is that it is happening with grpc 1.45, so I think there is some sort of side-effect to the socket change suggested that I can't quite understand (yet), any ideas? I can send and updated gist if needed that adds the Hello functionality.
Again, thanks for the help so far.