grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

Document that `ClientStream.SendMsg()` returns `nil` on success for unary/server-streaming RPCs

Open ash2k opened this issue 1 year ago • 14 comments

Use case(s) - what problem will this feature solve?

When a unary RPC is called, this code gets executed:

https://github.com/grpc/grpc-go/blob/59954c80165855ecd478a79e75bcc7d64f629ded/call.go#L65-L74

SendMsg() docs say this:

https://github.com/grpc/grpc-go/blob/59954c80165855ecd478a79e75bcc7d64f629ded/stream.go#L112-L115

I have a test for something unrelated where the server immediately returns an error. Very rarely, depending on the timing my test gets EOF instead of the error that the server returns. This is because server is occasionally faster than the client. It occurred to me that this is an issue for all unary RPCs - there is no way to get the actual error as the stream is not exposed to the user.

Proposed Solution

Check for io.EOF from SendMsg() and call RecvMsg() to get the actual error, return it.

Alternatives Considered

Leave it as is. This might be a requirement to be consistent with other implementations (I have not checked). Perhaps gRPC spec needs to be adjusted/clarified to change the behavior in this case.

Additional Context

gRPC v1.63.2. Go 1.22.3

ash2k avatar May 13 '24 11:05 ash2k

Yes, I think this is just an oversight in our code here. If SendMsg returns io.EOF, we should call RecvMsg and use the error there. I believe it's possible for the RPC to end successfully even if the message can't be sent, so if RecvMsg returns nil, then the RPC should end with a success. There should definitely be a test for this. The server handler needs to be a streaming RPC handler that immediately ends the RPC (both success & failure should be tested). I'm not sure if we can inject something into the connection to make sends from the client to the server slower in order to stimulate the race more reliably, but that's one possible option.

dfawley avatar May 13 '24 15:05 dfawley

To be clear, I don't have bandwidth to work on the fix myself.

ash2k avatar May 17 '24 00:05 ash2k

No problem! This is rare but probably important enough that we should prioritize this internally. Also the fix is probably quite easy.

dfawley avatar May 17 '24 01:05 dfawley

@dfawley Hi, is it possible for me pick this one up? Looks like we need to get the following done (please correct me if I'm wrong):

  1. create a test that can reliably reproduce the scenario where SendMsg returns io.EOF (server closes connection before client actually sends?) and the client doesn't get the expected actual error

  2. implement the proposed solution and make sure it make the test pass

Kailun2047 avatar May 17 '24 01:05 Kailun2047

@Kailun2047 Sure, that sounds good to me, as long as you think you can work on it pretty soon.

For the repro, I wonder if we can reproduce it by making a large request, and configure the server with a very small initial window size. That way the client won't be able to send the full request as long as the server handler treats the method as streaming. That should make it 100% reproducible, and be fairly simple to do.

dfawley avatar May 17 '24 01:05 dfawley

@dfawley Got it. Will try to delve in and make an update here within a day or two.

Kailun2047 avatar May 17 '24 01:05 Kailun2047

@dfawley Hi, I tried creating a test per earlier discussion to reproduce but had no luck :sweat_smile: On StubServer, SendMsg always succeeds even when the request payload size is increased to the max allowable (4194304).

But upon some initial inspection, looks like the EOF reported might not originate from SendMsg. Concretely, for invocation of an unary RPC,

Instead, it appears to me that the source of the EOF might be a different issue in RecvMsg. Concretely,

  • recvMsg might return EOF to indicate end of stream

  • the upper level RecvMsg gets the EOF returned and calls finish with it. In finish we convert EOF to nil, but the conversion isn't likely to be reflected in Recv because the original error is passed by value instead of by reference. Therefore it's possible for RecvMsg to return EOF.

I'm definitely willing to put more investigation into this, but let me know if you are concerned about the timing and need to prioritize internally :smile:

Kailun2047 avatar May 20 '24 06:05 Kailun2047

Oh interesting. I thought it was something simpler, but you're right - some of the streaming code actually checks the type of the RPC and adjusts its behavior accordingly. If you're able to get to the bottom of this, we would definitely appreciate it. No worries on the timing at all - I don't think it's as urgent if it's a pretty rare occurrence.

dfawley avatar May 21 '24 15:05 dfawley

@ash2k Hi - since we now might have a bigger space to search for the potential source of the EOF, is it possible to share more info about the test mentioned in OP? Specifically, are there any ServerOption/DialOption/CallOption being used there? Any other info will also be appreciated. Thanks!

The existing gRPC tests around Invoke (in which server also return error immediately) don't seem to help reproduce so I figure there could be something missing.

Kailun2047 avatar May 30 '24 01:05 Kailun2047

Without going into details, my code emulates a unary RPC over a streaming RPC. That's probably why you cannot reproduce the issue.

when sending message with the client stream, the stream will not return EOF for non-client-streaming RPCs

This means code in invoke() relies on an undocumented behavior that EOF is not returned from SendMsg() if it's a unary RPC. I see two options:

  • This behavior should be documented as part of the SendMsg() doc.
  • Special behavior that you linked to can be removed and code in invoke() should handle the general documented behavior of SendMsg(). It's a matter of moving the if.

ash2k avatar May 30 '24 02:05 ash2k

@dfawley Whoops, turned out initially I had some misunderstanding on the issue :sweat_smile: But I think I get it now (thanks to @ash2k 's response):

  1. emulating the unary RPC client's SendMsg & RecvMsg sequence on a streaming RPC client doesn't quite work (repro), because an additional check for EOF from SendMsg is needed
  2. the SendMsg & RecvMsg sequence does work for unary RPC (i.e. invoke) though, because of the behavior SendMsg has on non-streaming RPC type
  3. we might want to adjust documentation/implementation of SendMsg per suggested by @ash2k to avoid potential confusion (in our case it makes user suspect that invoke doesn't handle return value of SendMsg properly)

Sounds like documentation update is an easier option, but let me know if there's more to considerate :smiley:

Kailun2047 avatar Jun 03 '24 14:06 Kailun2047

I'm leaning towards getting rid of the special handling of non-streaming RPCs in sendMsg and handling the io.EOF in invoke. This would unify the behaviour for both kinds of RPCs and make the function easier to reason about.

The special handling in sendMsg has been there for for quite some time and the comment was changed 6 years ago in https://github.com/grpc/grpc-go/pull/1854

So I'm concerned about accidentally changing any user facing behaviour. Let me discuss with the team and share an update.

arjan-bal avatar Jun 05 '24 09:06 arjan-bal

Assigning to @dfawley to comment on the preferred fix. Since he's presently on vacation, expect a reply by end of the week.

arjan-bal avatar Jun 18 '24 18:06 arjan-bal

Returning io.EOF here would break existing code quite badly - our generated code relies upon SendMsg not returning an error on success. So we should update the docs to say something like: for ClientStreams=false streams, nil is returned unconditionally.

dfawley avatar Jun 27 '24 17:06 dfawley