grpc-go Clients returns `context.Cancelled` and later on sends the request

What version of gRPC are you using?

https://github.com/grpc/grpc-go/commit/761c084e5ac8669104533efa7c614aa1f881c96a

What version of Go are you using (`go version`)?

go version go1.19.3 darwin/amd64

What operating system (Linux, Windows, …) and version?

MacOS

What did you do?

We noticed that sometimes the go-grpc client is returning error (context.Cancelled) but later on it still put the request on the network. Seems that this issue was discussed on https://github.com/grpc/grpc-go/issues/4371 but in this issue we only consider the following scenarios:

The cancellation happens before the message is put on the network => Server does not receive the request
The cancellation happens after the message is put on the network => Server receives and processes the request

Those scenarios makes sense as the request can be cancelled before or after it was already sent (put on the wire) and there is no way to guarantee that a cancelled request on the client side will not be processed on the server side.

But there is a third scenario not discussed on the original issue:

The cancellation happens before the message is put on the network and the client put the message on the network anyway AFTER returning an error to the caller.

This seems to be a very unexpected/weird behavior. I would expect if client return an error before the request is sent over the network, it should not be sent later on.

We can see this happening in the following unit test:

type CustomCodec struct {
	protoCodec encoding.Codec
}

func (c *CustomCodec) Marshal(v interface{}) ([]byte, error) {
	if r, ok := v.(*testpb.SimpleRequest); ok {
		time.Sleep(10 * time.Millisecond)
		return r.Payload.Body, nil
	}
	return c.protoCodec.Marshal(v)
}

func (c *CustomCodec) Unmarshal(data []byte, v interface{}) error {
	if r, ok := v.(*testpb.SimpleRequest); ok {
		r.Payload = &testpb.Payload{
			Body: data,
		}
		return nil
	}

	return c.protoCodec.Unmarshal(data, v)
}

func (c *CustomCodec) Name() string {
	return "mycodec"
}

func Test_ContextCancelled(t *testing.T) {
	encoding.RegisterCodec(&CustomCodec{protoCodec: encoding.GetCodec("proto")})
	ss := &stubserver.StubServer{
		UnaryCallF: func(ctx context.Context, in *testpb.SimpleRequest) (*testpb.SimpleResponse, error) {
			if string(in.Payload.Body) != "right" {
				t.Errorf("Received wrong body %v\n", string(in.Payload.Body))
			}
			return &testpb.SimpleResponse{Payload: &testpb.Payload{Body: in.Payload.Body}}, nil
		},
	}
	err := ss.Start([]grpc.ServerOption{})

	if err != nil {
		t.Fatalf("Error starting endpoint server: %v", err)
	}
	defer ss.Stop()

	for i := 0; i < 100; i++ {
		req := &testpb.SimpleRequest{
			Payload: &testpb.Payload{
				Type: testpb.PayloadType_COMPRESSABLE,
				Body: []byte("right"),
			},
		}
		ctx, _ := context.WithTimeout(context.Background(), 10*time.Millisecond)
		ss.Client.UnaryCall(ctx, req, grpc.CallContentSubtype("mycodec"))
		copy(req.Payload.Body, "wrong")
	}
}

What did you expect to see?

If the client returns an error before sending the request over the network, it should not sent it later on (this is different than receiving an timeout on the client side AFTER the request was sent over the network and the server still processed the request).

What did you see instead?

The client returns an error and later on send the request (we can see that on the test where we modify the request after the ss.Client.UnaryCall and we receive the modified request on the server side.

Jun 07 '23 00:06 alanprot

I don't understand the significance of what you're talking about. Context cancelation or deadline expiration are asynchronous events from the write to the network. So it's always possible that the two will race, and your system needs to be designed to be resilient to this possibility. Maybe I'm missing something, though.

Jun 07 '23 17:06 dfawley

Yes, these are two asynchronous events (context cancelation/deadline expiration) and writing data to a network? Thus, there is no way to prevent #3 from occurring, even if there is a thought that it might be beneficial to the system.

Jun 08 '23 04:06 zasweq

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

Jun 14 '23 04:06 github-actions[bot]

Hi, thanks all for looking into this.

This is indeed not a problem in most circumstances, but under certain scenario, some implementations could decide to reuse the request/slices objects after the client returns error/success and the observed behaviour can be problematic.

We already check if the context is cancelled in many places of the code path (ex here),but it's not checked again after "dequeueing" the writes on the Http2Client.. Looking at the code seems like we could check again if the context is cancelled after copying the data to a local buffer here and prevent this behavior. I can try to open a PR accordingly.

So it's always possible that the two will race, and your system needs to be designed to be resilient to this possibility

If this behavior is by design and we decide not to change it, I would say it should be more precisely well documented - as assuming the request can be reused after the client returns an error seems to be a fair assumption. Looking at the codec documentation for instance, it only mentions that the implementation should be thread safe but does not mention that the byte[] returned should be deep copy/single use - if we are using the “legacy” Marshaling this seems even easier to happen as we can define the ‘Marshal’ method on the object itself.

ex:

func (m *SimpleRequest) Marshal() (dAtA []byte, err error) {
....
}

Jun 15 '23 00:06 alanprot

@dfawley Any thoughts on this? Do you think it make sense to re-check the context when "dequeueing" the write on the http client side?

Jun 19 '23 16:06 alanprot

This is WAI. The reader and writer goroutine of control buf communicate with each other, and https://github.com/grpc/grpc-go/blob/6578ef72240d14f83140602042a3ed9b4faa9b2f/internal/transport/controlbuf.go#L537 shows the error handling needed in the writer and reader. This is all part of the guarantee of reads and writes on the stream. Why is it a problem to reuse the memory?

Jun 27 '23 18:06 zasweq

// Upon exiting, if the error causing the exit is not an I/O error, run() // flushes and closes the underlying connection. Otherwise, the connection is // left open to allow the I/O error to be encountered by the reader instead.

Idk how this comment is related with the problem. The "unexpected" behavior is not trigged by an IO error per se but by a timeout - context.Cancelled - which can or cannot be related to an IO error.

Why is it a problem to reuse the memory?

The problem is that grpc-client can still use the memory asynchronously after returning the error to the caller.

Jun 27 '23 18:06 alanprot

Right, but what I was getting at is you queue the error on the buffer, and thus can't return early in some cases where you want to reader/writer to hit say io.EOF. What technical problems arise from using memory asynchronously after returning an error? The component is not closed, it has simply returned an error. This behavior sounds like something that happens in many components (you call a function on a component and that function returns an error, and that component still continues to use memory...it's not destructed/cleaned up yet).

Jun 27 '23 19:06 zasweq

The component is not closed, it has simply returned an error. This behavior sounds like something that happens in many components (you call a function on a component and that function returns an error, and that component still continues to use memory...it's not destructed/cleaned up yet).

Idk.. I think it is a fair assumption that the memory will not be used anymore if the method returned. I can even see a similar thing on this repository on this recent PR where is also "assumed" that the buf can be reused after recvAndDecompress returns - in this case indeed it only reuse if no error but idk if this is obvious - it seems more a very implicit contract.

Jun 27 '23 19:06 alanprot

@dfawley Any thoughts on this? Do you think it make sense to re-check the context when "dequeueing" the write on the http client side?

No, because you can't get around a race that way unless there's a lock involved. Even if you check it immediately before using it, if the context becomes canceled right after the check, then you would lose.

I'm in favor of documenting the APIs where it isn't clear who has ownership of the data. Would you be able to send a PR to fix up the places where you're running into problems and we're violating expectations?

Jul 28 '23 16:07 dfawley

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

Aug 03 '23 18:08 github-actions[bot]

No, because you can't get around a race that way unless there's a lock involved. Even if you check it immediately before using it, if the context becomes canceled right after the check, then you would lose.

It make sense.. but at least in some cases we do copy the bytes to a local buffer and so checking after copying would be safe - yes the request would still go but at least with the right content. Idk if would be a huge overhead to always copy this data to a local buffer.

https://github.com/grpc/grpc-go/blob/6578ef72240d14f83140602042a3ed9b4faa9b2f/internal/transport/controlbuf.go#L953-L956

Aug 03 '23 21:08 alanprot

Idk if would be a huge overhead to always copy this data to a local buffer.

My guess is it would be, because it would escape the stack, and heap allocations are expensive because of garbage collection.

Aug 08 '23 17:08 dfawley

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

Aug 14 '23 19:08 github-actions[bot]

grpc-go grpc-go copied to clipboard

Clients returns `context.Cancelled` and later on sends the request

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

grpc-go
grpc-go copied to clipboard

What version of Go are you using (`go version`)?