redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Fix timequery returning wrong offset after trim-prefix which could lead to stuck consumers

Open nvartolomei opened this issue 10 months ago • 7 comments

The fix works only if compression is not in use. We need follow-up work which would decompress the batches to find the exact offset to return, or (!) we could to prevent trim-offset inside a batch in that case.

Backports Required

  • [ ] none - not a bug fix
  • [ ] none - this is a backport
  • [ ] none - issue does not exist in previous branches
  • [ ] none - papercut/not impactful enough to backport
  • [x] v24.1.x
  • [x] v23.3.x
  • [ ] v23.2.x

Release Notes

Bug Fixes

  • Fix a scenario where list_offset with a timestamp could return a lower offset than partition start after a trim-prefix command. This could lead to consumers being stuck with an out-of-range-offset exception if they began consuming from an offset below the one which was used in the trim-prefix command.

nvartolomei avatar Apr 26 '24 18:04 nvartolomei

/dt

nvartolomei avatar Apr 26 '24 18:04 nvartolomei

new failures in https://buildkite.com/redpanda/redpanda/builds/48359#018f1bfc-2f63-4411-a634-cf0b04ce2121:

"rptest.tests.timequery_test.TimeQueryTest.test_timequery_with_trim_prefix.cloud_storage=True.spillover=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/48359#018f1bfc-2f6e-44ea-b952-03424f452652:

"rptest.tests.timequery_test.TimeQueryTest.test_timequery_with_trim_prefix.cloud_storage=True.spillover=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/48359#018f1bfc-2f6a-4bb0-a615-ef80b9171807:

"rptest.tests.timequery_test.TimeQueryTest.test_timequery_with_trim_prefix.cloud_storage=False.spillover=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/48359#018f1c03-5241-4440-8611-7211f0dc7557:

"rptest.tests.timequery_test.TimeQueryTest.test_timequery_with_trim_prefix.cloud_storage=True.spillover=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/48359#018f1c03-523f-49ff-9ce6-c3d7891e976a:

"rptest.tests.timequery_test.TimeQueryTest.test_timequery_with_trim_prefix.cloud_storage=True.spillover=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/48359#018f1c03-523c-4deb-b28c-f6da1f88bbb3:

"rptest.tests.timequery_test.TimeQueryTest.test_timequery_with_trim_prefix.cloud_storage=False.spillover=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/48489#018f2fdb-6ae5-408d-a787-ec3ba9f51914:

"rptest.tests.timequery_test.TimeQueryTest.test_timequery_with_trim_prefix.cloud_storage=True.spillover=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/48489#018f2fdb-6aed-42ff-95b3-6f79da4b9bc5:

"rptest.tests.cluster_config_test.ClusterConfigAliasTest.test_aliasing_with_upgrade.wipe_cache=False.prop_set=PropertyAliasData.primary_name=.cloud_storage_graceful_transfer_timeout_ms.aliased_name=.cloud_storage_graceful_transfer_timeout.redpanda_version=.23.2.test_values=.1234.1235.1236.expect_restart=False"

vbotbuildovich avatar Apr 26 '24 20:04 vbotbuildovich

  • Rebased on dev to resolve conflicts after a PR introduced by Willem to fix an unrelated timequery bug.
  • Addressed reviewer's comments.
  • Updated offset_range to bounded_offset_range to avoid misuse. It is less useful and more useful both at the same time!

Let's see what CI says.

nvartolomei avatar Apr 30 '24 15:04 nvartolomei

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48489#018f2fe3-c4cc-4176-acb7-f136b4e36f1f

vbotbuildovich avatar Apr 30 '24 18:04 vbotbuildovich

Force pushed:

  • improve commit message as requested by @andrwng
  • add a fix for an edge case where cloud storage shouldn't be read at all https://github.com/redpanda-data/redpanda/pull/18112/commits/f906e2480a47194242286a606b46389479f896fa
  • commented out trim prefix with tiered storage as they run into an (existing) edge case which will be addressed in another PR

nvartolomei avatar Apr 30 '24 19:04 nvartolomei

Force push:

  • Fix off-by-one error in reader max offset
  • Rename bounded_offset_range to bounded_offset_interval and redesign the API to make it easier to use correctly/harder to misuse

nvartolomei avatar Apr 30 '24 20:04 nvartolomei

Last 2 force-pushes fixes some typos in text.

nvartolomei avatar Apr 30 '24 20:04 nvartolomei

Merging this to unblock https://github.com/redpanda-data/redpanda/pull/18097. Will address comments as follow ups.

nvartolomei avatar May 07 '24 11:05 nvartolomei

/backport v24.1.x

vbotbuildovich avatar May 07 '24 11:05 vbotbuildovich

/backport v23.3.x

vbotbuildovich avatar May 07 '24 11:05 vbotbuildovich

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-18112-v23.3.x-984 remotes/upstream/v23.3.x
git cherry-pick -x 99d2bec5f7ee765cb3de88b446278262f6dae84f d97d61fb8a9b06fbde9dcf7ff03799bf200b561b 4f87afa392201c493f24f21af3e1cd7f0727649f f13bfa6c490490487d9a926c9a5d4e441adc3ca6 76a1ea2452b09a5730f2574646fc06ab2b8b8e32 f9ed5cabe479b355d370bec6bc9b693ad2928f3c a40999d2a09e0586c3fa81521c4d5fb5d0abc9dc 8f2de964c0e915f4f10ae8eb74400e6288c5680f

Workflow run logs.

vbotbuildovich avatar May 07 '24 11:05 vbotbuildovich