risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

`test_backfill_tombstone` failed in `main-cron` backfill tests: `key UserKey { 1004, TableKey { 0000008558294d8b000000 } } epoch EpochWithGap(6017238017507328) >= prev epoch EpochWithGap(6017238017507328)`

Open kwannoel opened this issue 1 year ago • 1 comments

buildkite: https://buildkite.com/risingwavelabs/main-cron/builds/1899#018deb4c-29e0-4ad8-a9a9-bc7c5ea713a4

Compute node logs: Screenshot 2024-02-28 at 1 33 23 PM

You can run the tests locally, see ci/scripts/run-backfill-tests.sh.

The test which fails is:

e2e, test_backfill_tombstone

You can comment out the other tests at the bottom of the file while debugging.

kwannoel avatar Feb 28 '24 05:02 kwannoel

2024-02-27T16:22:45.072413567Z  WARN risingwave_stream::executor::dml: txn_id=4294967296 Too many chunks for atomicity. Sent them to the downstream anyway.
thread 'rw-compaction' panicked at /risingwave/src/storage/hummock_sdk/src/key.rs:1047:21:
key UserKey { 1004, TableKey { 0000008558294d8b000000 } } epoch EpochWithGap(6017238017507328) >= prev epoch EpochWithGap(6017238017507328)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/core/src/panicking.rs:72:14
   2: observe_multi_version<bytes::bytes::Bytes, bytes::bytes::Bytes, core::iter::adapters::map::Map<core::slice::iter::Iter<(risingwave_hummock_sdk::EpochWithGap, risingwave_storage::hummock::value::HummockValue<bytes::bytes::Bytes>)>, risingwave_storage::hummock::compactor::shared_buffer_compact::merge_imms_in_memory::{async_fn#0}::{closure_env#0}>>
   3: {async_fn#0}
             at ./src/storage/src/hummock/compactor/shared_buffer_compact.rs:335:38
   4: poll<risingwave_storage::hummock::compactor::shared_buffer_compact::merge_imms_in_memory::{async_fn_env#0}>

kwannoel avatar Feb 28 '24 05:02 kwannoel

Is there any recent change to the test itself? I saw that the test can still pass yesterday. I checked the diff commits and didn't see any suspicious changes.

The assertion should never be hit regardless.

hzxa21 avatar Feb 28 '24 08:02 hzxa21

Is there any recent change to the test itself? I saw that the test can still pass yesterday. I checked the diff commits and didn't see any suspicious changes.

The assertion should never be hit regardless.

Last change to test seems to be 28 days ago. I think it may not happen every run perhaps. It depends on compaction which may not happen deterministically.

kwannoel avatar Feb 28 '24 08:02 kwannoel

https://buildkite.com/risingwavelabs/main-cron/builds/1899#018deb4c-297e-49f5-ae69-027fae04caf6

This test also triggers the same assertion.

hzxa21 avatar Feb 28 '24 10:02 hzxa21

Root cause found:

  • In CI, we build risingwave_compaction_test and risingwavew_cmd_all together. https://github.com/risingwavelabs/risingwave/blob/82dba308595870876eb9fab6465d9782f78ba255/ci/scripts/build.sh#L48-L56
  • enable_test_epoch feature is enabled for risingwave_compaction_test in dependencies so it will be also included in risingwave_cmd_all due to the feature unification mechanism in rust. https://github.com/risingwavelabs/risingwave/blob/82dba308595870876eb9fab6465d9782f78ba255/src/tests/compaction_test/Cargo.toml#L27
  • When enable_test_epoch feature is enabled, the spill_offset (the lower 16bit) will be ignored when constructing a epoch. When spill happens in the test (re-enabled by default a few days ago via #15240), there can be user key with the same epoch, triggering the assertion. https://github.com/risingwavelabs/risingwave/blob/82dba308595870876eb9fab6465d9782f78ba255/src/storage/hummock_sdk/src/lib.rs#L306

The reason why we cannot repro it locally is because build.sh won't be run in our local repro and only risingwave_cmd_all will be built. In this case, enable_test_epoch won't be enabled. I verified locally that enable_test_epoch is indeed enabled when running the exact same command in build.sh.

The usage of enable_test_epoch feature is hacky and prone to issues. @wcy-fdu has a PR to remove it. We will prioritize merging it.

hzxa21 avatar Feb 29 '24 06:02 hzxa21