grpc-go
grpc-go copied to clipboard
Clients returns `context.Cancelled` and later on sends the request
What version of gRPC are you using?
https://github.com/grpc/grpc-go/commit/761c084e5ac8669104533efa7c614aa1f881c96a
What version of Go are you using (go version
)?
go version go1.19.3 darwin/amd64
What operating system (Linux, Windows, …) and version?
MacOS
What did you do?
We noticed that sometimes the go-grpc client is returning error (context.Cancelled) but later on it still put the request on the network. Seems that this issue was discussed on https://github.com/grpc/grpc-go/issues/4371 but in this issue we only consider the following scenarios:
- The cancellation happens before the message is put on the network => Server does not receive the request
- The cancellation happens after the message is put on the network => Server receives and processes the request
Those scenarios makes sense as the request can be cancelled before or after it was already sent (put on the wire) and there is no way to guarantee that a cancelled request on the client side will not be processed on the server side.
But there is a third scenario not discussed on the original issue:
- The cancellation happens before the message is put on the network and the client put the message on the network anyway AFTER returning an error to the caller.
This seems to be a very unexpected/weird behavior. I would expect if client return an error before the request is sent over the network, it should not be sent later on.
We can see this happening in the following unit test:
type CustomCodec struct {
protoCodec encoding.Codec
}
func (c *CustomCodec) Marshal(v interface{}) ([]byte, error) {
if r, ok := v.(*testpb.SimpleRequest); ok {
time.Sleep(10 * time.Millisecond)
return r.Payload.Body, nil
}
return c.protoCodec.Marshal(v)
}
func (c *CustomCodec) Unmarshal(data []byte, v interface{}) error {
if r, ok := v.(*testpb.SimpleRequest); ok {
r.Payload = &testpb.Payload{
Body: data,
}
return nil
}
return c.protoCodec.Unmarshal(data, v)
}
func (c *CustomCodec) Name() string {
return "mycodec"
}
func Test_ContextCancelled(t *testing.T) {
encoding.RegisterCodec(&CustomCodec{protoCodec: encoding.GetCodec("proto")})
ss := &stubserver.StubServer{
UnaryCallF: func(ctx context.Context, in *testpb.SimpleRequest) (*testpb.SimpleResponse, error) {
if string(in.Payload.Body) != "right" {
t.Errorf("Received wrong body %v\n", string(in.Payload.Body))
}
return &testpb.SimpleResponse{Payload: &testpb.Payload{Body: in.Payload.Body}}, nil
},
}
err := ss.Start([]grpc.ServerOption{})
if err != nil {
t.Fatalf("Error starting endpoint server: %v", err)
}
defer ss.Stop()
for i := 0; i < 100; i++ {
req := &testpb.SimpleRequest{
Payload: &testpb.Payload{
Type: testpb.PayloadType_COMPRESSABLE,
Body: []byte("right"),
},
}
ctx, _ := context.WithTimeout(context.Background(), 10*time.Millisecond)
ss.Client.UnaryCall(ctx, req, grpc.CallContentSubtype("mycodec"))
copy(req.Payload.Body, "wrong")
}
}
What did you expect to see?
If the client returns an error before sending the request over the network, it should not sent it later on (this is different than receiving an timeout on the client side AFTER the request was sent over the network and the server still processed the request).
What did you see instead?
The client returns an error and later on send the request (we can see that on the test where we modify the request after the ss.Client.UnaryCall
and we receive the modified request on the server side.
I don't understand the significance of what you're talking about. Context cancelation or deadline expiration are asynchronous events from the write to the network. So it's always possible that the two will race, and your system needs to be designed to be resilient to this possibility. Maybe I'm missing something, though.
Yes, these are two asynchronous events (context cancelation/deadline expiration) and writing data to a network? Thus, there is no way to prevent #3 from occurring, even if there is a thought that it might be beneficial to the system.
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
Hi, thanks all for looking into this.
This is indeed not a problem in most circumstances, but under certain scenario, some implementations could decide to reuse the request/slices objects after the client returns error/success and the observed behaviour can be problematic.
We already check if the context is cancelled in many places of the code path (ex here),but it's not checked again after "dequeueing" the writes on the Http2Client.. Looking at the code seems like we could check again if the context is cancelled after copying the data to a local buffer here and prevent this behavior. I can try to open a PR accordingly.
So it's always possible that the two will race, and your system needs to be designed to be resilient to this possibility
If this behavior is by design and we decide not to change it, I would say it should be more precisely well documented - as assuming the request can be reused after the client returns an error seems to be a fair assumption. Looking at the codec documentation for instance, it only mentions that the implementation should be thread safe but does not mention that the byte[] returned should be deep copy/single use - if we are using the “legacy” Marshaling this seems even easier to happen as we can define the ‘Marshal’ method on the object itself.
ex:
func (m *SimpleRequest) Marshal() (dAtA []byte, err error) {
....
}
@dfawley Any thoughts on this? Do you think it make sense to re-check the context when "dequeueing" the write on the http client side?
This is WAI. The reader and writer goroutine of control buf communicate with each other, and https://github.com/grpc/grpc-go/blob/6578ef72240d14f83140602042a3ed9b4faa9b2f/internal/transport/controlbuf.go#L537 shows the error handling needed in the writer and reader. This is all part of the guarantee of reads and writes on the stream. Why is it a problem to reuse the memory?
// Upon exiting, if the error causing the exit is not an I/O error, run() // flushes and closes the underlying connection. Otherwise, the connection is // left open to allow the I/O error to be encountered by the reader instead.
Idk how this comment is related with the problem. The "unexpected" behavior is not trigged by an IO error per se but by a timeout - context.Cancelled - which can or cannot be related to an IO error.
Why is it a problem to reuse the memory?
The problem is that grpc-client can still use the memory asynchronously after returning the error to the caller.
Right, but what I was getting at is you queue the error on the buffer, and thus can't return early in some cases where you want to reader/writer to hit say io.EOF. What technical problems arise from using memory asynchronously after returning an error? The component is not closed, it has simply returned an error. This behavior sounds like something that happens in many components (you call a function on a component and that function returns an error, and that component still continues to use memory...it's not destructed/cleaned up yet).
The component is not closed, it has simply returned an error. This behavior sounds like something that happens in many components (you call a function on a component and that function returns an error, and that component still continues to use memory...it's not destructed/cleaned up yet).
Idk.. I think it is a fair assumption that the memory will not be used anymore if the method returned. I can even see a similar thing on this repository on this recent PR where is also "assumed" that the buf
can be reused after recvAndDecompress
returns - in this case indeed it only reuse if no error but idk if this is obvious - it seems more a very implicit contract.
@dfawley Any thoughts on this? Do you think it make sense to re-check the context when "dequeueing" the write on the http client side?
No, because you can't get around a race that way unless there's a lock involved. Even if you check it immediately before using it, if the context becomes canceled right after the check, then you would lose.
I'm in favor of documenting the APIs where it isn't clear who has ownership of the data. Would you be able to send a PR to fix up the places where you're running into problems and we're violating expectations?
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
No, because you can't get around a race that way unless there's a lock involved. Even if you check it immediately before using it, if the context becomes canceled right after the check, then you would lose.
It make sense.. but at least in some cases we do copy the bytes to a local buffer and so checking after copying would be safe - yes the request would still go but at least with the right content. Idk if would be a huge overhead to always copy this data to a local buffer.
https://github.com/grpc/grpc-go/blob/6578ef72240d14f83140602042a3ed9b4faa9b2f/internal/transport/controlbuf.go#L953-L956
Idk if would be a huge overhead to always copy this data to a local buffer.
My guess is it would be, because it would escape the stack, and heap allocations are expensive because of garbage collection.
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.