grpcbox icon indicating copy to clipboard operation
grpcbox copied to clipboard

Race condition when closing a stream via `grpcbox_client:close_and_recv/1`

Open davisp opened this issue 4 years ago • 3 comments

I've got a bit of a conundrum trying to chase down a spurious error in a test suite. I'm occasionally getting timeout back from grpcbox_client:close_and_recv/1 instead of the expected stream_finished.

The code between grpcbox and chatterbox is a bit hard to follow. But near as I can tell, what's happening is that we're receiving the eos message at [1] and then there's a race to get to [2] where most of the time is_process_alive/1 returns false although occasionally it can return true which leads to us getting a timeout.

At this point I think I'm just going to remove my assertion on the return of grpcbox_client:close_and_recv/1 because I can't figure out an appropriate patch given how we demonitor when receiving the eos message. The only thing I can think of is to have a new grpcbox_client_stream:recv_end/2 call that ends up returning any messages in the mailbox until the process is dead or something? But that seemed not quite right so I figured I'd just open the issue to see if anyone else had any better ideas.

[1] https://github.com/tsloughter/grpcbox/blob/v0.9.1/src/grpcbox_client.erl#L197-L199 [2] https://github.com/tsloughter/grpcbox/blob/master/src/grpcbox_client_stream.erl#L117-L122

davisp avatar Aug 20 '19 16:08 davisp

Hey, I'll dig into this soon, I've been wanting/needing to work on the client some more. I'll get back to you on this specific issue first. The client is quirky, I don't like how it is sending messages to the process that made the request, it should maybe be an optional setting if it is really desired to receive messages this way.

tsloughter avatar Aug 20 '19 17:08 tsloughter

I think there is likely a hack fix to this that I can do before I full refactoring of the client.

Since the stream has ended it is assumed the process will be gone by the time it checks if the process is alive. I can change that recv with a 0 timeout to be either a different function or include an attribute to signal it is after the stream has ended.

tsloughter avatar Sep 01 '19 14:09 tsloughter

Yeah, my only thought was to have a separate function there that would add a timeout to the recv since we're expecting it to be dead. I just don't know enough of the details to know how much of recv_msg would be normally be expected to possibly execute in those circumstances and what would more likely constitute an error.

davisp avatar Sep 04 '19 15:09 davisp