risingwave
risingwave copied to clipboard
`test_backfill_tombstone` failed in `main-cron` backfill tests: `key UserKey { 1004, TableKey { 0000008558294d8b000000 } } epoch EpochWithGap(6017238017507328) >= prev epoch EpochWithGap(6017238017507328)`
buildkite: https://buildkite.com/risingwavelabs/main-cron/builds/1899#018deb4c-29e0-4ad8-a9a9-bc7c5ea713a4
Compute node logs:
You can run the tests locally, see ci/scripts/run-backfill-tests.sh
.
The test which fails is:
e2e, test_backfill_tombstone
You can comment out the other tests at the bottom of the file while debugging.
2024-02-27T16:22:45.072413567Z WARN risingwave_stream::executor::dml: txn_id=4294967296 Too many chunks for atomicity. Sent them to the downstream anyway.
thread 'rw-compaction' panicked at /risingwave/src/storage/hummock_sdk/src/key.rs:1047:21:
key UserKey { 1004, TableKey { 0000008558294d8b000000 } } epoch EpochWithGap(6017238017507328) >= prev epoch EpochWithGap(6017238017507328)
stack backtrace:
0: rust_begin_unwind
at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:645:5
1: core::panicking::panic_fmt
at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/core/src/panicking.rs:72:14
2: observe_multi_version<bytes::bytes::Bytes, bytes::bytes::Bytes, core::iter::adapters::map::Map<core::slice::iter::Iter<(risingwave_hummock_sdk::EpochWithGap, risingwave_storage::hummock::value::HummockValue<bytes::bytes::Bytes>)>, risingwave_storage::hummock::compactor::shared_buffer_compact::merge_imms_in_memory::{async_fn#0}::{closure_env#0}>>
3: {async_fn#0}
at ./src/storage/src/hummock/compactor/shared_buffer_compact.rs:335:38
4: poll<risingwave_storage::hummock::compactor::shared_buffer_compact::merge_imms_in_memory::{async_fn_env#0}>
Is there any recent change to the test itself? I saw that the test can still pass yesterday. I checked the diff commits and didn't see any suspicious changes.
The assertion should never be hit regardless.
Is there any recent change to the test itself? I saw that the test can still pass yesterday. I checked the diff commits and didn't see any suspicious changes.
The assertion should never be hit regardless.
Last change to test seems to be 28 days ago. I think it may not happen every run perhaps. It depends on compaction which may not happen deterministically.
https://buildkite.com/risingwavelabs/main-cron/builds/1899#018deb4c-297e-49f5-ae69-027fae04caf6
This test also triggers the same assertion.
Root cause found:
- In CI, we build
risingwave_compaction_test
andrisingwavew_cmd_all
together. https://github.com/risingwavelabs/risingwave/blob/82dba308595870876eb9fab6465d9782f78ba255/ci/scripts/build.sh#L48-L56 -
enable_test_epoch
feature is enabled forrisingwave_compaction_test
in dependencies so it will be also included inrisingwave_cmd_all
due to the feature unification mechanism in rust. https://github.com/risingwavelabs/risingwave/blob/82dba308595870876eb9fab6465d9782f78ba255/src/tests/compaction_test/Cargo.toml#L27 - When
enable_test_epoch
feature is enabled, thespill_offset
(the lower 16bit) will be ignored when constructing a epoch. When spill happens in the test (re-enabled by default a few days ago via #15240), there can be user key with the same epoch, triggering the assertion. https://github.com/risingwavelabs/risingwave/blob/82dba308595870876eb9fab6465d9782f78ba255/src/storage/hummock_sdk/src/lib.rs#L306
The reason why we cannot repro it locally is because build.sh
won't be run in our local repro and only risingwave_cmd_all
will be built. In this case, enable_test_epoch
won't be enabled. I verified locally that enable_test_epoch
is indeed enabled when running the exact same command in build.sh
.
The usage of enable_test_epoch
feature is hacky and prone to issues. @wcy-fdu has a PR to remove it. We will prioritize merging it.