Reproduction of #1825 (use of closed network connection) with trace
Acknowledgements
- [X] I have searched (https://github.com/aws/aws-sdk/issues?q=is%3Aissue) for past instances of this issue
- [X] I have verified all of my SDK modules are up-to-date (you can perform a bulk update with
go get -u github.com/aws/aws-sdk-go-v2/...)
Describe the bug
This is a reproduction of #1825 with the tracing suggested by @lucix-aws
2024-08-07T03:26:57.822Z !!! READ ON CLIENT-CLOSED CONN !!!
2024-08-07T03:26:57.822Z goroutine 191 [running]:
2024-08-07T03:26:57.822Z runtime/debug.Stack()
2024-08-07T03:26:57.822Z /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
2024-08-07T03:26:57.822Z github.com/---OurCode---/dynclient.(*tracedConn).Close(0xc0008bbbc0)
2024-08-07T03:26:57.822Z /go/pkg/mod/github.com/---OurCode---/client.go:53 +0x18
2024-08-07T03:26:57.822Z crypto/tls.(*Conn).Close(0xc00027d180?)
2024-08-07T03:26:57.822Z /usr/local/go/src/crypto/tls/conn.go:1428 +0xc8
2024-08-07T03:26:57.822Z net/http.(*persistConn).closeLocked(0xc0009c6000, {0x1d177e0, 0x2707f80})
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:2746 +0x13a
2024-08-07T03:26:57.822Z net/http.(*persistConn).close(0xc00016d900?, {0x1d177e0?, 0x2707f80?})
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:2731 +0xa8
2024-08-07T03:26:57.822Z net/http.(*persistConn).writeLoop(0xc0009c6000)
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:2447 +0x29b
2024-08-07T03:26:57.822Z created by net/http.(*Transport).dialConn in goroutine 198
2024-08-07T03:26:57.822Z /usr/local/go/src/net/http/transport.go:1777 +0x16f1
2024-08-07T03:26:57.822Z !!! -(end trace)-------------- !!!
Expected Behavior
dynamodb Client.Query to return without error
Current Behavior
dynamodb Client.Query returns error:
operation error DynamoDB: Query, https response error StatusCode: 200, RequestID: GUR48MP1C12O956O84J2H67F7NVV4KQNSO5AEMVJF66Q9ASUAAJG, deserialization failed, failed to decode response body, read tcp 169.254.76.1:56796-\u003e52.94.0.72:443: use of closed network connection"}
Reproduction Steps
result, err := db.Query(
ctx,
&dynamodb.QueryInput{
TableName: aws.String(name),
ConsistentRead: aws.Bool(true),
Limit: aws.Int32(count),
KeyConditionExpression: aws.String(keyConditionExpression),
ExpressionAttributeValues: expressionAttributeValues,
ExclusiveStartKey: exclusiveStartKey,
},
)
Possible Solution
No response
Additional Information/Context
No response
AWS Go SDK V2 Module Versions Used
github.com/aws/aws-lambda-go v1.47.0
github.com/aws/aws-sdk-go-v2 v1.27.1
github.com/aws/aws-sdk-go-v2/service/s3 v1.54.4
github.com/aws/aws-sdk-go-v2/service/sns v1.29.9
github.com/aws/smithy-go v1.20.2
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.2 // indirect
github.com/aws/aws-sdk-go-v2/config v1.27.17 // indirect
github.com/aws/aws-sdk-go-v2/credentials v1.17.17 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.4 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.8 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.8 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.8.0 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.3.8 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.32.7 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.2 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.3.10 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.9.9 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.10 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.17.8 // indirect
github.com/aws/aws-sdk-go-v2/service/kms v1.32.2 // indirect
github.com/aws/aws-sdk-go-v2/service/sqs v1.32.4 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.20.10 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.24.4 // indirect
github.com/aws/aws-sdk-go-v2/service/sts v1.28.11 // indirect
Compiler and Version used
go version go1.21.3 linux/amd64
Operating System and version
lambda runtime provided.al2
same here
goroutine 893760 [running]: runtime/debug.Stack() /usr/local/go/src/runtime/debug/stack.go:24 +0x5e parser/cmd/parser/storage.(*tracedConn).Close(0xc00070e840) /go/src/cmd/parser/storage/traceconn.go:40 +0x18 crypto/tls.(*Conn).Close(0xc0000e8dd0?) /usr/local/go/src/crypto/tls/conn.go:1429 +0xc8 net/http.(*persistConn).closeLocked(0xc000745200, {0xce89a0, 0x1101eb0}) /usr/local/go/src/net/http/transport.go:2783 +0x13a net/http.(*persistConn).close(0xc0002e08c0?, {0xce89a0?, 0x1101eb0?}) /usr/local/go/src/net/http/transport.go:2768 +0xa8 net/http.(*persistConn).closeConnIfStillIdle(0xc000745200) /usr/local/go/src/net/http/transport.go:2040 +0xbe created by time.goFunc /usr/local/go/src/time/sleep.go:177 +0x2d
For what it's worth, I actually get two different stack traces when I run with the tracing code. The one I provided in the description matches the time I saw the closed network connection error, and happens very infrequently. Hover, I frequently get the trace:
2024-08-07T03:26:22.589Z !!! READ ON CLIENT-CLOSED CONN !!!
2024-08-07T03:26:22.589Z goroutine 181 [running]:
2024-08-07T03:26:22.589Z runtime/debug.Stack()
2024-08-07T03:26:22.589Z /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
2024-08-07T03:26:22.589Z github.com/---OurCode---/dynclient.(*tracedConn).Close(0xc00021a040)
2024-08-07T03:26:22.589Z /go/pkg/mod/github.com/---OurCode---/client.go:53 +0x18
2024-08-07T03:26:22.589Z crypto/tls.(*Conn).Close(0xc00027d180?)
2024-08-07T03:26:22.589Z /usr/local/go/src/crypto/tls/conn.go:1428 +0xc8
2024-08-07T03:26:22.589Z net/http.(*persistConn).closeLocked(0xc000672240, {0x1d177e0, 0x2708ba0})
2024-08-07T03:26:22.589Z /usr/local/go/src/net/http/transport.go:2746 +0x13a
2024-08-07T03:26:22.589Z net/http.(*persistConn).close(0xc00027d180?, {0x1d177e0?, 0x2708ba0?})
2024-08-07T03:26:22.589Z /usr/local/go/src/net/http/transport.go:2731 +0xa8
2024-08-07T03:26:22.589Z net/http.(*persistConn).closeConnIfStillIdle(0xc000672240)
2024-08-07T03:26:22.589Z /usr/local/go/src/net/http/transport.go:2017 +0xbe
2024-08-07T03:26:22.589Z created by time.goFunc
2024-08-07T03:26:22.589Z /usr/local/go/src/time/sleep.go:176 +0x2d
2024-08-07T03:26:22.589Z !!! -(end trace)-------------- !!!
but it doesn't seem to correspond to an error in my client code.
TYVM for taking the time to repro. I'll try to deep dive this week or early next.
These traces certainly appear to demonstrate that it's net/http that's ultimately closing the conns out from under us. Whether we indirectly caused said close remains to be seen.
@lucix-aws hi, I have a theory but, I do not know how to probe it:
I think the real issue is hidden by ignoring the error of closing the request body here https://github.com/aws/smithy-go/blob/bed421c3d7515856512061d107cf2ae3254cf3b8/transport/http/client.go#L116
also I saw a similar issue in the aws-sdk-v1 but the error was "connection closed by peer" maybe the are related with this issue? https://github.com/search?q=repo%3Aaws%2Faws-sdk-go+connection+reset+by+peer&type=issues
Copying this comment I left on the original issue:
You can make the underlying error net.ErrClosed retryable in your client with the following:
type retryErrClosed struct {
aws.Retryer
}
func (r *retryErrClosed) IsErrorRetryable(err error) bool {
return errors.Is(err, net.ErrClosed) || r.Retryer.IsErrorRetryable(err)
}
svc := s3.NewFromConfig(cfg, func(o *s3.Options) {
o.Retryer = &retryErrClosed{o.Retryer}
})
This should be safe to do for get-type operations such as dynamodb.Query.
There's been a lot of back-and-forth about the premature request body close we're doing here. Some have commented claiming that removing it in a fork fixes it for them. At least one person commented that removing it temporarily fixed the issue, but it eventually re-appeared after throughput increased.
Unless I've missed something... this is a total red herring. We need to take a closer look at what this code is actually doing in practice. Let's look at dynamodb.Query, an oft-mentioned operation around this issue, as an example.
Some background: dynamodb.Query is a request-response style operation - we serialize the input structure to the wire and deserialize the HTTP response into the output structure. All of the request/response body Close()ing and such are opaque to the caller. This is different from a streaming-blob style operation like S3 GetObject - where you as the caller literally pass in a stream that we use as the http request body directly.
So when you call Query:
- we turn
QueryInputinto a json payload to be sent over the wire - in the
OperationSerializermiddleware, we pass that json payload tosmithyhttp.Request.SetStream()(code-generated source) - in the
ComputeContentLengthmiddleware, this check happens - which causes the http.Request.ContentLength field to be set, since the length of the underlying request stream is known here (it's just a bytes.Reader underneath) - We call smithyhttp.Request.Build() - which wraps the plain io.Reader stream previously set by SetStream in an io.NopCloser - which does nothing and never returns an error - and an additional self-implemented safeReadCloser that makes Close() calls idempotent (source)
- we Do() the request using the configured HTTP client
- we do the controversial request body close, ignoring the error
So plainly speaking -- the closing behavior that people claim to be problematic is literally a no-op. There is no error or side effect here on the request body close whatsoever, because for request-response style operations, the input is always a byte buffer wrapped in an io.NopCloser. There's no chance for us here, then, to be interfering with the HTTP client's connection lifecycle or anything like that. From a static analysis perspective, there is simply absolutely zero indication that this request body close routine is contributing to the appearance of this networking error.
@lucix-aws Thanks for the detailed explanation. Based on my recent testing findings, these use of closed connection errors are thrown when the Querying some large sized data around consuming 5-6 RCU's under high throughput. In fact tried without setting HTTP timeout which is technically no timeouts. 5-6 RCU caps around 20-24 KB of data which is small enough not to batch, but under high load this may causing prematurely closing the connection before deserializing the complete response. In fact added filters to filter out the data, prolly while transferring on wire the size shrinks around 10-12 KB. But didn't see this issue with the forked version where the closing of body is commented.
But didn't see this issue with the forked version where the closing of body is commented.
You're experiencing a placebo effect here. Removing that logic does nothing, because the logic itself does nothing as explained above.
In all likelihood this is simply a race in the underlying net.Conn read/write loop logic. I'm not an SME here but I've been doing a deep dive on this part of the stdlib and it seems like connection resets occur in one of two ways:
- a client calls
Read()on the response body which triggers a synchronous data pull on the connection. The peer sends an RST and we get an error like "connection reset". We handle this today in retries. - the conn read loop asynchronously pulls an RST, which causes the connection to be closed - which is what we see in the stack traces above. In this scenario we get the "use of closed network connection" error. We do not handle this in retries.
The fix is most likely to just add the latter to our set of retryables by default, as my snippet above shows, since these errors are effectively the same thing. Even if they're not -- a retry should get past this anyway.
@lucix-aws Yes retry should solve this, but there's no proper way to categorize the error type. Had to check for certain keywords and contains use of closed network connection to recognize if it's retryable or not.
This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.