aws-sdk-go-v2 Reproduction of #1825 (use of closed network connection) with trace

Acknowledgements

[X] I have searched (https://github.com/aws/aws-sdk/issues?q=is%3Aissue) for past instances of this issue
[X] I have verified all of my SDK modules are up-to-date (you can perform a bulk update with go get -u github.com/aws/aws-sdk-go-v2/...)

Describe the bug

This is a reproduction of #1825 with the tracing suggested by @lucix-aws

2024-08-07T03:26:57.822Z !!! READ ON CLIENT-CLOSED CONN !!!
2024-08-07T03:26:57.822Z goroutine 191 [running]:
2024-08-07T03:26:57.822Z runtime/debug.Stack()
2024-08-07T03:26:57.822Z /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
2024-08-07T03:26:57.822Z github.com/---OurCode---/dynclient.(*tracedConn).Close(0xc0008bbbc0)
2024-08-07T03:26:57.822Z /go/pkg/mod/github.com/---OurCode---/client.go:53 +0x18
2024-08-07T03:26:57.822Z crypto/tls.(*Conn).Close(0xc00027d180?)
2024-08-07T03:26:57.822Z /usr/local/go/src/crypto/tls/conn.go:1428 +0xc8
2024-08-07T03:26:57.822Z net/http.(*persistConn).closeLocked(0xc0009c6000, {0x1d177e0, 0x2707f80})
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:2746 +0x13a
2024-08-07T03:26:57.822Z net/http.(*persistConn).close(0xc00016d900?, {0x1d177e0?, 0x2707f80?})
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:2731 +0xa8
2024-08-07T03:26:57.822Z net/http.(*persistConn).writeLoop(0xc0009c6000)
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:2447 +0x29b
2024-08-07T03:26:57.822Z created by net/http.(*Transport).dialConn in goroutine 198
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:1777 +0x16f1
2024-08-07T03:26:57.822Z !!! -(end trace)-------------- !!!

Expected Behavior

dynamodb Client.Query to return without error

Current Behavior

dynamodb Client.Query returns error:

operation error DynamoDB: Query, https response error StatusCode: 200, RequestID: GUR48MP1C12O956O84J2H67F7NVV4KQNSO5AEMVJF66Q9ASUAAJG, deserialization failed, failed to decode response body, read tcp 169.254.76.1:56796-\u003e52.94.0.72:443: use of closed network connection"}

Reproduction Steps

result, err := db.Query(
        ctx,
        &dynamodb.QueryInput{
                TableName:                 aws.String(name),
                ConsistentRead:            aws.Bool(true),
                Limit:                     aws.Int32(count),
                KeyConditionExpression:    aws.String(keyConditionExpression),
                ExpressionAttributeValues: expressionAttributeValues,
                ExclusiveStartKey:         exclusiveStartKey,
        },
)

Possible Solution

No response

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

    github.com/aws/aws-lambda-go v1.47.0
    github.com/aws/aws-sdk-go-v2 v1.27.1
    github.com/aws/aws-sdk-go-v2/service/s3 v1.54.4
    github.com/aws/aws-sdk-go-v2/service/sns v1.29.9
    github.com/aws/smithy-go v1.20.2
    github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.2 // indirect
    github.com/aws/aws-sdk-go-v2/config v1.27.17 // indirect
    github.com/aws/aws-sdk-go-v2/credentials v1.17.17 // indirect
    github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.4 // indirect
    github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.8 // indirect
    github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.8 // indirect
    github.com/aws/aws-sdk-go-v2/internal/ini v1.8.0 // indirect
    github.com/aws/aws-sdk-go-v2/internal/v4a v1.3.8 // indirect
    github.com/aws/aws-sdk-go-v2/service/dynamodb v1.32.7 // indirect
    github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.2 // indirect
    github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.3.10 // indirect
    github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.9.9 // indirect
    github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.10 // indirect
    github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.17.8 // indirect
    github.com/aws/aws-sdk-go-v2/service/kms v1.32.2 // indirect
    github.com/aws/aws-sdk-go-v2/service/sqs v1.32.4 // indirect
    github.com/aws/aws-sdk-go-v2/service/sso v1.20.10 // indirect
    github.com/aws/aws-sdk-go-v2/service/ssooidc v1.24.4 // indirect
    github.com/aws/aws-sdk-go-v2/service/sts v1.28.11 // indirect

Compiler and Version used

go version go1.21.3 linux/amd64

Operating System and version

lambda runtime provided.al2

Aug 07 '24 22:08 chrisr-spectra

same here

goroutine 893760 [running]: runtime/debug.Stack() /usr/local/go/src/runtime/debug/stack.go:24 +0x5e parser/cmd/parser/storage.(*tracedConn).Close(0xc00070e840) /go/src/cmd/parser/storage/traceconn.go:40 +0x18 crypto/tls.(*Conn).Close(0xc0000e8dd0?) /usr/local/go/src/crypto/tls/conn.go:1429 +0xc8 net/http.(*persistConn).closeLocked(0xc000745200, {0xce89a0, 0x1101eb0}) /usr/local/go/src/net/http/transport.go:2783 +0x13a net/http.(*persistConn).close(0xc0002e08c0?, {0xce89a0?, 0x1101eb0?}) /usr/local/go/src/net/http/transport.go:2768 +0xa8 net/http.(*persistConn).closeConnIfStillIdle(0xc000745200) /usr/local/go/src/net/http/transport.go:2040 +0xbe created by time.goFunc /usr/local/go/src/time/sleep.go:177 +0x2d

Aug 08 '24 08:08 zapisanchez

For what it's worth, I actually get two different stack traces when I run with the tracing code. The one I provided in the description matches the time I saw the closed network connection error, and happens very infrequently. Hover, I frequently get the trace:

2024-08-07T03:26:22.589Z !!! READ ON CLIENT-CLOSED CONN !!!
2024-08-07T03:26:22.589Z goroutine 181 [running]:
2024-08-07T03:26:22.589Z runtime/debug.Stack()
2024-08-07T03:26:22.589Z /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
2024-08-07T03:26:22.589Z github.com/---OurCode---/dynclient.(*tracedConn).Close(0xc00021a040)
2024-08-07T03:26:22.589Z /go/pkg/mod/github.com/---OurCode---/client.go:53 +0x18
2024-08-07T03:26:22.589Z crypto/tls.(*Conn).Close(0xc00027d180?)
2024-08-07T03:26:22.589Z /usr/local/go/src/crypto/tls/conn.go:1428 +0xc8
2024-08-07T03:26:22.589Z net/http.(*persistConn).closeLocked(0xc000672240, {0x1d177e0, 0x2708ba0})
2024-08-07T03:26:22.589Z /usr/local/go/src/net/http/transport.go:2746 +0x13a
2024-08-07T03:26:22.589Z net/http.(*persistConn).close(0xc00027d180?, {0x1d177e0?, 0x2708ba0?})
2024-08-07T03:26:22.589Z /usr/local/go/src/net/http/transport.go:2731 +0xa8
2024-08-07T03:26:22.589Z net/http.(*persistConn).closeConnIfStillIdle(0xc000672240)
2024-08-07T03:26:22.589Z /usr/local/go/src/net/http/transport.go:2017 +0xbe
2024-08-07T03:26:22.589Z created by time.goFunc
2024-08-07T03:26:22.589Z /usr/local/go/src/time/sleep.go:176 +0x2d
2024-08-07T03:26:22.589Z !!! -(end trace)-------------- !!!

but it doesn't seem to correspond to an error in my client code.

Aug 08 '24 13:08 chrisr-spectra

TYVM for taking the time to repro. I'll try to deep dive this week or early next.

Aug 08 '24 14:08 lucix-aws

These traces certainly appear to demonstrate that it's net/http that's ultimately closing the conns out from under us. Whether we indirectly caused said close remains to be seen.

Aug 08 '24 14:08 lucix-aws

@lucix-aws hi, I have a theory but, I do not know how to probe it:

I think the real issue is hidden by ignoring the error of closing the request body here https://github.com/aws/smithy-go/blob/bed421c3d7515856512061d107cf2ae3254cf3b8/transport/http/client.go#L116

also I saw a similar issue in the aws-sdk-v1 but the error was "connection closed by peer" maybe the are related with this issue? https://github.com/search?q=repo%3Aaws%2Faws-sdk-go+connection+reset+by+peer&type=issues

Dec 12 '24 17:12 kuadrosx

Copying this comment I left on the original issue:

You can make the underlying error net.ErrClosed retryable in your client with the following:

type retryErrClosed struct {
	aws.Retryer
}

func (r *retryErrClosed) IsErrorRetryable(err error) bool {
	return errors.Is(err, net.ErrClosed) || r.Retryer.IsErrorRetryable(err)
}
	
svc := s3.NewFromConfig(cfg, func(o *s3.Options) {
	o.Retryer = &retryErrClosed{o.Retryer}	
})

This should be safe to do for get-type operations such as dynamodb.Query.

Dec 16 '24 21:12 lucix-aws

There's been a lot of back-and-forth about the premature request body close we're doing here. Some have commented claiming that removing it in a fork fixes it for them. At least one person commented that removing it temporarily fixed the issue, but it eventually re-appeared after throughput increased.

Unless I've missed something... this is a total red herring. We need to take a closer look at what this code is actually doing in practice. Let's look at dynamodb.Query, an oft-mentioned operation around this issue, as an example.

Some background: dynamodb.Query is a request-response style operation - we serialize the input structure to the wire and deserialize the HTTP response into the output structure. All of the request/response body Close()ing and such are opaque to the caller. This is different from a streaming-blob style operation like S3 GetObject - where you as the caller literally pass in a stream that we use as the http request body directly.

So when you call Query:

we turn QueryInput into a json payload to be sent over the wire
in the OperationSerializer middleware, we pass that json payload to smithyhttp.Request.SetStream() (code-generated source)
in the ComputeContentLength middleware, this check happens - which causes the http.Request.ContentLength field to be set, since the length of the underlying request stream is known here (it's just a bytes.Reader underneath)
We call smithyhttp.Request.Build() - which wraps the plain io.Reader stream previously set by SetStream in an io.NopCloser - which does nothing and never returns an error - and an additional self-implemented safeReadCloser that makes Close() calls idempotent (source)
we Do() the request using the configured HTTP client
we do the controversial request body close, ignoring the error

So plainly speaking -- the closing behavior that people claim to be problematic is literally a no-op. There is no error or side effect here on the request body close whatsoever, because for request-response style operations, the input is always a byte buffer wrapped in an io.NopCloser. There's no chance for us here, then, to be interfering with the HTTP client's connection lifecycle or anything like that. From a static analysis perspective, there is simply absolutely zero indication that this request body close routine is contributing to the appearance of this networking error.

Dec 16 '24 22:12 lucix-aws

@lucix-aws Thanks for the detailed explanation. Based on my recent testing findings, these use of closed connection errors are thrown when the Querying some large sized data around consuming 5-6 RCU's under high throughput. In fact tried without setting HTTP timeout which is technically no timeouts. 5-6 RCU caps around 20-24 KB of data which is small enough not to batch, but under high load this may causing prematurely closing the connection before deserializing the complete response. In fact added filters to filter out the data, prolly while transferring on wire the size shrinks around 10-12 KB. But didn't see this issue with the forked version where the closing of body is commented.

Dec 17 '24 17:12 sathishdv

But didn't see this issue with the forked version where the closing of body is commented.

You're experiencing a placebo effect here. Removing that logic does nothing, because the logic itself does nothing as explained above.

Dec 17 '24 17:12 lucix-aws

In all likelihood this is simply a race in the underlying net.Conn read/write loop logic. I'm not an SME here but I've been doing a deep dive on this part of the stdlib and it seems like connection resets occur in one of two ways:

a client calls Read() on the response body which triggers a synchronous data pull on the connection. The peer sends an RST and we get an error like "connection reset". We handle this today in retries.
the conn read loop asynchronously pulls an RST, which causes the connection to be closed - which is what we see in the stack traces above. In this scenario we get the "use of closed network connection" error. We do not handle this in retries.

The fix is most likely to just add the latter to our set of retryables by default, as my snippet above shows, since these errors are effectively the same thing. Even if they're not -- a retry should get past this anyway.

Dec 17 '24 18:12 lucix-aws

@lucix-aws Yes retry should solve this, but there's no proper way to categorize the error type. Had to check for certain keywords and contains use of closed network connection to recognize if it's retryable or not.

Dec 18 '24 00:12 sathishdv

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

Jan 08 '25 21:01 github-actions[bot]