kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-19893: Reduce tiered storage redundancy with delayed upload (Topic-level feature) (KIP-1241)

Open jiafu1115 opened this issue 2 months ago • 9 comments

JIRA:19893 KIP:1241

Currently, Kafka uploads all non-active local log segments to remote storage even when they are still within the local retention period, resulting in redundant storage of the same data in both tiers. This wastes storage capacity (cost) without providing immediate benefits,since reads during the retention window prioritize local data.

However, some users/topics rely on remote storage for real-time analytics and need the latest data to be available as soon as possible (In fact, it only tries to stay as up-to-date as possible, but it still can’t include the latest data because the active segment hasn’t been uploaded yet.). Therefore, this optimization is offered as a topic's optional configuration rather than the default behavior.

Here are some additional thoughts/considerations.

  1. Local files won’t be deleted until they’ve been uploaded to the remote storage, so this change is very safe—you don’t need to worry about files being cleaned up before they be upload to the remote.
  2. Considering the latency of remote storage, the local retention period won’t be set too short. For example, in our production environment, we keep 1 day of local data alongside 3-7 days in remote storage, so there’s still 1 day of redundancy.

Example for the goal: image

jiafu1115 avatar Nov 18 '25 11:11 jiafu1115

Attach test result: [Precondition] Create one topic enable remote stroage in Kafka (3 brokers + 3 controller)

local storage time: 20 minutes
remote stroage time: 40 minutes
partition:  3
segement.bytes: 10M
image

[Steps]

  1. Deploy this code patch into one broker only and restart the broker
  2. Keep sending a lot of messages to the topic
  3. Check the disk sizes on both local and remote storage at two points in time: 20 minutes before and 1 hour after.

[Result]

Before 20 minutes:

  1. only 2 partition upload the local to remote.

After 1 hour:

  1. The remote storage size for one partition (on the broker with the code change) is much smaller than the other two. image
  2. The sizes of the local disks are similar. image

jiafu1115 avatar Nov 18 '25 11:11 jiafu1115

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Nov 26 '25 03:11 github-actions[bot]

Hi, @kamalcph Sorry to bother you. I know you’ve been deeply involved in the remote storage area, and I was wondering if you might be interested — when you have some free time — in taking a look at this cost-saving topic and providing some guidance. Thank you very much!

jiafu1115 avatar Nov 28 '25 02:11 jiafu1115

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Nov 29 '25 03:11 github-actions[bot]

image image

cc @kamalcph here due to community's email don't allow to attach the image. We can discuss the content in email about the KIP. Thanks

jiafu1115 avatar Dec 02 '25 10:12 jiafu1115

@jiafu1115

The already uploaded segments are eligible for deletion from broker. So, when remote storage is down, then those segments can be deleted as per the local retention settings and new segments can occupy those space. This provides more time for the Admin to act when remote storage is down for a longer time.

kamalcph avatar Dec 02 '25 11:12 kamalcph

@kamalcph I think I understand what you mean now. I’ve updated the picture above. Could you help double-check whether we’ve reached the same understanding? The drawback of this KIP is that, during a long time remote storage outage. it will occupied more disk so that admin may need one extra disk expansion. The max value is the redundant part we saving. Thus. After the outage recovered. It will come back to the beginning. Right?

jiafu1115 avatar Dec 02 '25 12:12 jiafu1115

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Dec 04 '25 03:12 github-actions[bot]

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Dec 06 '25 03:12 github-actions[bot]