gRPC-haskell icon indicating copy to clipboard operation
gRPC-haskell copied to clipboard

Deadlock with two communicating endpoints

Open OlivierNicole opened this issue 7 years ago • 3 comments

I'm trying to use gRPC to implement endpoints that can both receive and send streams of ByteStrings (i.e. Streams from the streaming library). Each endpoint is a gRPC Server and creates a new Client whenever it needs to send a stream of messages.

As I couldn't fit your high-level API to my needs, I used the lower-level bindings. However, I get a strange deadlock when trying to make two endpoints communicate. The "server" blocks on serverRequestCall while the "client" blocks on recvInitialMetadata.

I am not sure I used your bindings exactly right. Here is a minimal example that reproduces my bug : https://gist.github.com/OlivierNicole/bccf2d7c4306a93a869bf0521e4c952b .

The output with debug activated is:

[ThreadId 5]: startServer: server CQ: CompletionQueue 0x00007f05280018c0
[ThreadId 6]: startServer: server CQ: CompletionQueue 0x00007f0520001b40
server started.
server started.
receive: called.
[ThreadId 6]: serverRequestCall(R): got pluck permission, registering call for tag=Tag {unTag = 0x8000000000000000}
client created.
[ThreadId 5]: channelCreateCall: call with Channel 0x00007f0520002480 Call 0x0000000000000000 PropagationMask {unPropagationMask = 65535} CompletionQueue 0x00007f0520002c70 CallHandle {unCallHandle = 0x00007f0520000bf0} 0x00007f0520000bd0
[ThreadId 6]: pluck: blocking on grpc_completion_queue_pluck for tag=Tag {unTag = 0x8000000000000000}
[ThreadId 5]: runOps: allocated op contexts: [OpSendInitialMetadataContext (MetadataKeyValPtr 0x00007f0520000bd0) 0]
[ThreadId 5]: runOps: tag: Tag {unTag = 0x8000000000000000}
[ThreadId 5]: startBatch: calling grpc_call_start_batch with pointers: Call 0x00007f0520002eb0 OpArray 0x00007f0520002330
[ThreadId 5]: startBatch: grpc_call_start_batch call returned.
[ThreadId 5]: runOps: called start_batch. callError: Right ()
[ThreadId 5]: pluck: called with tag=Tag {unTag = 0x8000000000000000},mwait=Nothing
[ThreadId 5]: pluck: blocking on grpc_completion_queue_pluck for tag=Tag {unTag = 0x8000000000000000}
[ThreadId 5]: pluck finished: Event {eventCompletionType = OpComplete, eventSuccess = True, eventTag = Tag {unTag = 0x8000000000000000}}
[ThreadId 5]: runOps: pluck returned Right ()
[ThreadId 5]: runOps: got good op; starting.
[ThreadId 5]: resultFromOpContext: saw non-result op type.
client sent initial metadadata.
send: receiveLock taken.
[ThreadId 5]: runOps: allocated op contexts: [OpRecvInitialMetadataContext 0x00007f0520000bd0]
[ThreadId 5]: runOps: tag: Tag {unTag = 0x8000000000000001}
[ThreadId 5]: startBatch: calling grpc_call_start_batch with pointers: Call 0x00007f0520002eb0 OpArray 0x00007f0520002330
[ThreadId 5]: startBatch: grpc_call_start_batch call returned.
[ThreadId 5]: runOps: called start_batch. callError: Right ()
[ThreadId 5]: pluck: called with tag=Tag {unTag = 0x8000000000000001},mwait=Nothing
[ThreadId 5]: pluck: blocking on grpc_completion_queue_pluck for tag=Tag {unTag = 0x8000000000000001}

[[ hangs forever... ]]

Is there an obvious mistake that I have made ? (I am particularly unsure about the way I do metadata exchange).

OlivierNicole avatar Jul 07 '17 10:07 OlivierNicole

I would strongly discourage using the low level interface directly. It exposes implementation details of the core gRPC library that you don't get access to in other languages' bindings (so you shouldn't need access to those details). If your design can't be expressed with our high level bindings, that probably means we need to fix our high level bindings to be more expressive. Could you go into more detail about the difficulties you encountered with the high level modules?

If you do want to use the low level bindings, definitely try to use the clientRW and serverRW functions in Network.GRPC.LowLevel.Client and Network.GRPC.LowLevel.Server rather than sending ops manually. These functions handle the op ordering for you. There are also similar functions for all the streaming/non-streaming combinations. I wouldn't recommend even importing the LowLevel.Op module. The op system in gRPC looks like it will be flexible and let you make your own calls, but really there are only a handful of combinations that don't lead to deadlock. It's very brittle. I know there were some issues in the core gRPC repo for fixing this, but I'm not sure if anything has been done yet.

If you really want to order ops yourself, you can look at the gRPC core tests for examples of orderings that work. It can be very time-consuming to debug these problems, though!

crclark avatar Jul 07 '17 15:07 crclark

Thank you for your answer. Using the low-level operations was indeed time-consuming.

Initially, I was trying to build a notion of "session" between two endpoints, and I was implementing them on top of bidirectionally streaming requests. This implied to handle different phases of the life cycles of a request (creation, communication, destruction) in different places in my code, which the high-level bindings do not permit.

However, I have simplified my specifications since then and found a way to make it work with the higher-level API.

OlivierNicole avatar Jul 11 '17 08:07 OlivierNicole

Actually, I did have to use withServer and withClient from Network.GRPC.LowLevel, since the higher-level functions (resp. serverLoop and withGRPCClient) assume gRPC is not started, and initialize it, making it impossible to run several clients and servers in the same program. Once gRPC has been initialized, subsequent calls to grpc_init probably have no effect, but to be sure I had to run in a reader monad and use the "lower-level" functions.

Edit: precision.

OlivierNicole avatar Jul 11 '17 15:07 OlivierNicole