redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Potential HTTP client bug which bubbles up as a terminated connection

Open VladLazar opened this issue 2 years ago • 2 comments

Version & Environment

Redpanda version: (use rpk version): dev as of 25f8cccfc66c57f2a3a232941dc3d7995ccc7af2

Description

The connection to aws s3 is repeatedly being killed when a certain sequence of requests are sent via it. If retries are enabled and a new connection is created the previously failing request succeeds. I would have chalked this up to an aws s3 oddity, but it happens repeatedly for the same sequence of requests.

During the scrubbing of a partition, we issue DeleteObjects requests, to delete the segments within a manifest, followed up by a DeleteObject request to remove the manifest itself. In the tests, the DeleteObject request would always receive an empty chunk, which gets interpreted as an EOF and bubbled up as a retryable error.

# DeleteObjects request - proceeds fine
TRACE 2023-06-26 16:26:02,437 [shard 2] http - [/?delete] - client.cc:90 - client.make_request POST /?delete HTTP/1.1
Host: panda-bucket-e40ab074-143d-11ee-92f6-c94316310214.s3.us-west-2.amazonaws.com
Content-MD5: KlQfPLIMMh5kqjccPfNHrg==
Content-Length: 3111
x-amz-date: 20230626T162602Z
x-amz-content-sha256: [secret]
Authorization: [secret]

DEBUG 2023-06-26 16:26:02,437 [shard 2] http - [/?delete] - client.cc:102 - reusing connection, age 0
TRACE 2023-06-26 16:26:02,437 [shard 2] http - /?delete - client.cc:442 - request_stream.send_some 3111
# Chunked reponse for plural delete received
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /?delete - client.cc:297 - chunk received, chunk length 278
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /?delete - client.cc:297 - chunk received, chunk length 122
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /?delete - client.cc:367 - not all consumed, noctets 120, input size 2, output size 116, ec system:0
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /?delete - client.cc:297 - chunk received, chunk length 5
DEBUG 2023-06-26 16:26:02,966 [shard 2] http - iobuf_body.cc:82 - reader - finish called
DEBUG 2023-06-26 16:26:02,966 [shard 2] client_pool - client_pool.cc:318 - releasing a client, pool size: 19, capacity: 20

TRACE 2023-06-26 16:26:02,966 [shard 2] s3 - s3_client.cc:824 - send https request:
DELETE /00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin HTTP/1.1
User-Agent: redpanda.vectorized.io
Host: panda-bucket-e40ab074-143d-11ee-92f6-c94316310214.s3.us-west-2.amazonaws.com
Content-Length: 0
x-amz-date: 20230626T162602Z
x-amz-content-sha256: [secret]
Authorization: [secret]

# DeleteObject request, reusing the connection - fails
DEBUG 2023-06-26 16:26:02,966 [shard 2] http - [/00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin] - client.cc:102 - reusing connection, age 0
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin - client.cc:442 - request_stream.send_some 0
# EOF is read of the stream
TRACE 2023-06-26 16:26:02,966 [shard 2] http - /00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin - client.cc:297 - chunk received, chunk length 0
DEBUG 2023-06-26 16:26:02,966 [shard 2] http - /00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin - client.cc:264 - 'beast.http:26' error triggered
WARN  2023-06-26 16:26:02,966 [shard 2] http - /00000000/meta/kafka/topic-xlqamfleia/0_34/manifest.bin - client.cc:377 - receive error boost::system::system_error (unexpected eof in body [beast.http:26])
INFO  2023-06-26 16:26:02,966 [shard 2] s3 - util.cc:83 - Server disconnected: 'unexpected eof in body [beast.http:26]', retrying HTTP request

This behaviour can be either due to a bug in the client or S3 legitimately terminating the connection. I find the later unlikely as this always happens for the same sequence of requests (the scrubber attempted the same requests multiple times and the failure is always the same).

VladLazar avatar Jun 27 '23 10:06 VladLazar

Full log for broker performing the purge attached:

redpanda.log

VladLazar avatar Jun 27 '23 10:06 VladLazar

This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

github-actions[bot] avatar Feb 16 '24 06:02 github-actions[bot]

This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.

github-actions[bot] avatar Mar 04 '24 06:03 github-actions[bot]

We need to still chase this. cc: @dotnwat

piyushredpanda avatar Mar 04 '24 06:03 piyushredpanda