redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Adjacent segment merging for segments in S3

Open jcsp opened this issue 2 years ago • 0 comments

This is a similar thing to what we currently do for local disk segments in compacted topics, where it is called "adjacent segment compaction". Here I call it "adjacent segment merging" to distinguish the process of concatenating two segments rather than doing true compaction where duplicate keys are dropped.

Problem

The shadow indexing manifest contains a list of all segments. That means that smaller segments generate a larger manifest. However, if the manifest is too large (beyond some as yet unmeasured limit), we will have memory issues and I/O+CPU efficiency issues when writing it.

That manifest issue would make us prefer larger segments (e.g. our current 1GB segment size default). However, larger segments are also problematic:

  • We can't generate large segments reliably because segments roll when leadership transfers and node restarts happen.
  • We cannot generate large segment writes if constrained-RPO replication is required, because each time we upload fresh data for a partition into S3, it is done as a fresh segment -- these segments may be tiny if the write rate is low.

Solution

To enable the shadow indexing code to reliably control the segment size in S3 over long periods, it needs to be able to re-write history to concatenate some smaller segments into larger segments. This can be a relatively simple algorithm, that applies a minimum and maximum segment size (these might be e.g. 128MB and 256MB, or some other configurable limits), which would download two segments, concatenate them, upload the joined segment, and then update the manifest to point to the new segment + mark the old segments for cleanup.

Impact

This will reduce the operational risk to long-running clusters with tiered storage enabled and long retention periods.

jcsp avatar Sep 16 '22 13:09 jcsp