redpanda storage: improve handling of out-of-order timestamps within a log segment

(Follow up to https://github.com/redpanda-data/redpanda/pull/4010)

The order of messages in a segment does not have to be the same as the order of their timestamps:

We often issue a configuration batch that has a higher timestamp than some kafka batches that were waiting for the segment to be ready, as in #3924
Even if all our self-imposed timestamps were smooth, users can still specify CreateTime for their log policy, and thereby send in totally arbitrary timestamps.

In index_state, we store relative timestamps, taken as the difference from the segment's base timestamp, which is the timestamp of the first batch that was appended. This caused the #3924 problem because that offset could be negative, and was being stored in an unsigned uint32_t. We fixed that in #4010 by clamping those negative values to zero, which makes the index potentially wrong for cases where the timestamps are out of order, because potentially many or all of the batches in the segment will show up in the index as relative time=0 from the base_timestamp of the segment.

More generally, there is no guarantee at all that the incoming data is even within 2^32 milliseconds of the base timestamp. Consider the case where mirrormaker2 is copying across some timestamps from 2 months ago: base timestamp will be in current time from a raft configuration batch, and then all of the real timestamps will be >2^32 ahead of that.

Also, on log truncate we currently take the most recent timestamp from the truncated index as the max timestamp, which is wrong: we should be searching through the batch to find the new max timestamp, as they can be in any order. This is what Apache Kafka does during a truncate.

So, what should we change? Not exactly sure, but:

The index_state use of uint32_t relative encoding needs to change to account for out of order timestamps. Maybe offset the relative values by 2^31 so that we can tolerate timestamps that far in either direction from the base timestamp?
The mixing of raft_configuration batch timestamps and Kafka batch timestamps probably needs to stop: these can be arbitrarily far apart. Do we ever care about non-data batch timestamps in the index? Maybe we should only index on data batches, so that if the user is doing CreateTime timestamps then they will be okay as long as the users data is all timestamped within a reasonable range of each other.
We need to handle time deltas that are >2^32 (or 2^31) with the above change) in a consistent way. Maybe clamp them, and document it: if users try and claim that messages within the same segment are really so far apart, we are just going to overrule them and/or generate a crappy low-resolution index. However that expectation is bogus if we hold segments open for arbitrarily long, as we currently do: if someone leaves the same redpanda process running for months with a 1GB segment size and users write in occasional batches, it's actually totally reasonable for the batches timestamps to be months apart. So maybe we need to implement time-based log rolling, to provide a guarantee that timestamps within a segment are within a defined range of one another.

Mar 17 '22 21:03 jcsp

@jcsp i think something like a linked-list of uint32_t

[base [list]], [base2 [list2], [base3 [list3]]

the benefit of uint32 was fast lookups but totally fine to come up w/ something else. that was a poor man's compression algo that has found the limits of usage. good find.

Sep 30 '22 04:09 emaxerrno

Here's what's getting fixed in 22.3 (and backported to 22.2):

a real test for time queries.
segment indices will ignore non-data batches when indexing, unless it is the first batch in the segment. If a subsequent data batch appears, its timestamps will override those of the previous non-data batch: this way a batch with a mixture of user and internal timestamps will have an index that reflects the user timestamps.
our timequery path for for local storage will be fixed to find the batch containing the timestamp, rather than the subsequent batch
our timequery path for local storage will be fixed to seek to the correct record within a batch, rather than just returning the start offset of the batch
making time queries work for cloud storage topics (currently a time query on these will only start searching from the start of the local log)

There is still work to do here (not exhaustive list):

Making the timequery lookup ignore segments that do not contain any user data (these segments will have timestamps that aren't meaningful when the user is doing CreateTime, and in any case cannot contain any batches of interest to the Kafka client)
FIxing log truncation to update the max timestamp in a segment correctly, rather than just setting it to the timestamp of the last batch in the truncated segment.
Add tests for the above cases: e.g. in TimeQueryTest when using CreateTime's far from the walltime, do some empty log rolls to generate segments with config batches only, and ensure that these batches do not trip up the time query when it targets a user-supplied timestamp that should be after the config batches (even though the config batches will have timestamps far ahead of the query point).

Sep 30 '22 13:09 jcsp

Here's what was done for v23.1 (#8509):

The relative time index is used whenever possible (i.e. the batches in the segments have monotonic timestamps).
Support out of order user provided timestamps by offsetting the relative time index by 2^31. This gives us the rough representable range of: -596h...+596h. We also set the default maximum time a segment can not roll for to 2 weeks, which ensures that the timestamps will always be in range (when users don't provide them, in which case we clamp them in the representable range).
Improve the heuristic for finding the maximum timestamp on truncation. Previously, we simply took the last entry in the index which could be off by 64KB. Now, we take the timestamp from the batch preceding the truncation point.

Generally, these changes make the semantics for timequery stronger: "a timequery for timestamp 'needle' will return the first offset in the log where 'timestamp >= needle'".

What's left here is to fix the potential missing of timestamps coming from data and configuration batches. If a segment has configuration batches, the corresponding segment index will contain a walltime timestamp. If the data batches have user provided timestamps, then it's no bueno. We should stash a flag somewhere that indicates whether the segment has any data batches and use that to skip segments for "free" (in lock_manager::range_lock(const timequerY_config& cfg).

Feb 06 '23 16:02 VladLazar

redpanda redpanda copied to clipboard

storage: improve handling of out-of-order timestamps within a log segment

redpanda
redpanda copied to clipboard