redpanda
redpanda copied to clipboard
Potential HTTP client bug which bubbles up as a terminated connection
Version & Environment
Redpanda version: (use rpk version): dev as of 25f8cccfc66c57f2a3a232941dc3d7995ccc7af2
Description
The connection to aws s3 is repeatedly being killed when a certain sequence of requests are sent via it. If retries are enabled and a new connection is created the previously failing request succeeds. I would have chalked this up to an aws s3 oddity, but it happens repeatedly for the same sequence of requests.
During the scrubbing of a partition, we issue DeleteObjects requests, to delete the segments within a manifest,
followed up by a DeleteObject request to remove the manifest itself. In the tests, the DeleteObject request would
always receive an empty chunk, which gets interpreted as an EOF and bubbled up as a retryable error.
# DeleteObjects request - proceeds fine
TRACE 2023-06-26 16:26:02,437 [shard 2] http - [/?delete] - client.cc:90 - client.make_request POST /?delete HTTP/1.1
Host: panda-bucket-e40ab074-143d-11ee-92f6-c94316310214.s3.us-west-2.amazonaws.com
Content-MD5: KlQfPLIMMh5kqjccPfNHrg==
Content-Length: 3111
x-amz-date: 20230626T162602Z
x-amz-content-sha256: [secret]
Authorization: [secret]
DEBUG 2023-06-26 16:26:02,437 [shard 2] http - [/?delete] - client.cc:102 - reusing connection, age 0
TRACE 2023-06-26 16:26:02,437 [shard 2] http - /?delete - client.cc:442 - request_stream.send_some 3111
# Chunked reponse for plural delete received
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /?delete - client.cc:297 - chunk received, chunk length 278
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /?delete - client.cc:297 - chunk received, chunk length 122
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /?delete - client.cc:367 - not all consumed, noctets 120, input size 2, output size 116, ec system:0
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /?delete - client.cc:297 - chunk received, chunk length 5
DEBUG 2023-06-26 16:26:02,966 [shard 2] http - iobuf_body.cc:82 - reader - finish called
DEBUG 2023-06-26 16:26:02,966 [shard 2] client_pool - client_pool.cc:318 - releasing a client, pool size: 19, capacity: 20
TRACE 2023-06-26 16:26:02,966 [shard 2] s3 - s3_client.cc:824 - send https request:
DELETE /00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin HTTP/1.1
User-Agent: redpanda.vectorized.io
Host: panda-bucket-e40ab074-143d-11ee-92f6-c94316310214.s3.us-west-2.amazonaws.com
Content-Length: 0
x-amz-date: 20230626T162602Z
x-amz-content-sha256: [secret]
Authorization: [secret]
# DeleteObject request, reusing the connection - fails
DEBUG 2023-06-26 16:26:02,966 [shard 2] http - [/00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin] - client.cc:102 - reusing connection, age 0
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin - client.cc:442 - request_stream.send_some 0
# EOF is read of the stream
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin - client.cc:297 - chunk received, chunk length 0
DEBUG 2023-06-26 16:26:02,966 [shard 2] http - /00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin - client.cc:264 - 'beast.http:26' error triggered
WARN 2023-06-26 16:26:02,966 [shard 2] http - /00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin - client.cc:377 - receive error boost::system::system_error (unexpected eof in body [beast.http:26])
INFO 2023-06-26 16:26:02,966 [shard 2] s3 - util.cc:83 - Server disconnected: 'unexpected eof in body [beast.http:26]', retrying HTTP request
This behaviour can be either due to a bug in the client or S3 legitimately terminating the connection. I find the later unlikely as this always happens for the same sequence of requests (the scrubber attempted the same requests multiple times and the failure is always the same).
This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.
This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.
We need to still chase this. cc: @dotnwat