risingwave
risingwave copied to clipboard
bug: compactor block builder panics on assertion failed
https://github.com/singularity-data/risingwave/blob/001f5ac22ca46aae2e2df3dc265e4d0fc6c12d81/src/storage/src/hummock/sstable/block.rs#L227-L247
thread 'tokio-runtime-worker' panicked at 'assertion failed: `(left == right)`
left: `Equal`,
right: `Less`', src/storage/src/hummock/sstable/block.rs:243:13
stack backtrace:
0: rust_begin_unwind
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/panicking.rs:142:14
2: core::panicking::assert_failed_inner
3: core::panicking::assert_failed
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/panicking.rs:181:5
4: risingwave_storage::hummock::sstable::block::BlockBuilder::add
at ./src/storage/src/hummock/sstable/block.rs:243:13
5: risingwave_storage::hummock::sstable::builder::SstableBuilder::add
at ./src/storage/src/hummock/sstable/builder.rs:130:9
6: risingwave_storage::hummock::sstable::multi_builder::CapacitySplitTableBuilder<F>::add_full_key::{{closure}}
at ./src/storage/src/hummock/sstable/multi_builder.rs:121:9
7: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
8: risingwave_storage::hummock::compactor::Compactor::compact_and_build_sst::{{closure}}
at ./src/storage/src/hummock/compactor.rs:1021:92
9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
10: risingwave_storage::hummock::compactor::Compactor::compact_key_range_impl::{{closure}}
at ./src/storage/src/hummock/compactor.rs:716:10
11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
12: risingwave_storage::hummock::compactor::Compactor::compact_key_range_with_filter::{{closure}}
at ./src/storage/src/hummock/compactor.rs:782:74
13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
14: risingwave_storage::hummock::compactor::Compactor::compact::{{closure}}::{{closure}}
at ./src/storage/src/hummock/compactor.rs:556:84
15: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
16: risingwave_storage::hummock::compaction_executor::CompactionExecutor::send_request::{{closure}}
at ./src/storage/src/hummock/compaction_executor.rs:76:27
17: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
18: <core::pin::Pin<P> as core::future::future::Future>::poll
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/future.rs:124:9
19: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/task/task_local.rs:345:35
20: tokio::task::task_local::LocalKey<T>::scope_inner
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/task/task_local.rs:231:19
21: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/task/task_local.rs:342:13
22: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.35/src/instrument.rs:272:9
23: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/core.rs:165:17
24: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/loom/std/unsafe_cell.rs:14:9
25: tokio::runtime::task::core::CoreStage<T>::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/core.rs:155:13
26: tokio::runtime::task::harness::poll_future::{{closure}}
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:480:19
27: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/panic/unwind_safe.rs:271:9
28: std::panicking::try::do_call
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:492:40
29: ___rust_try
30: std::panicking::try
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:456:19
31: std::panic::catch_unwind
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panic.rs:137:14
32: tokio::runtime::task::harness::poll_future
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:468:18
33: tokio::runtime::task::harness::Harness<T,S>::poll_inner
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:104:27
34: tokio::runtime::task::harness::Harness<T,S>::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:57:15
35: tokio::runtime::task::raw::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/raw.rs:144:5
36: tokio::runtime::task::raw::RawTask::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/raw.rs:84:18
37: tokio::runtime::task::LocalNotified<S>::run
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/mod.rs:381:9
38: tokio::runtime::thread_pool::worker::Context::run_task::{{closure}}
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:435:13
39: tokio::coop::with_budget::{{closure}}
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/coop.rs:102:9
40: std::thread::local::LocalKey<T>::try_with
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/thread/local.rs:445:16
41: std::thread::local::LocalKey<T>::with
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/thread/local.rs:421:9
42: tokio::coop::with_budget
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/coop.rs:95:5
43: tokio::coop::budget
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/coop.rs:72:5
44: tokio::runtime::thread_pool::worker::Context::run_task
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:434:9
45: tokio::runtime::thread_pool::worker::Context::run
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:401:24
46: tokio::runtime::thread_pool::worker::run::{{closure}}
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:386:17
47: tokio::macros::scoped_tls::ScopedKey<T>::set
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/macros/scoped_tls.rs:61:9
48: tokio::runtime::thread_pool::worker::run
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:383:5
49: tokio::runtime::thread_pool::worker::Launch::launch::{{closure}}
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:362:45
50: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/blocking/task.rs:42:21
51: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.35/src/instrument.rs:272:9
52: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/core.rs:165:17
53: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/loom/std/unsafe_cell.rs:14:9
54: tokio::runtime::task::core::CoreStage<T>::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/core.rs:155:13
55: tokio::runtime::task::harness::poll_future::{{closure}}
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:480:19
56: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/panic/unwind_safe.rs:271:9
57: std::panicking::try::do_call
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:492:40
58: ___rust_try
59: std::panicking::try
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:456:19
60: std::panic::catch_unwind
at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panic.rs:137:14
61: tokio::runtime::task::harness::poll_future
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:468:18
62: tokio::runtime::task::harness::Harness<T,S>::poll_inner
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:104:27
63: tokio::runtime::task::harness::Harness<T,S>::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:57:15
64: tokio::runtime::task::raw::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/raw.rs:144:5
65: tokio::runtime::task::raw::RawTask::poll
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/raw.rs:84:18
66: tokio::runtime::task::UnownedTask<S>::run
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/mod.rs:418:9
67: tokio::runtime::blocking::pool::Task::run
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/blocking/pool.rs:91:9
68: tokio::runtime::blocking::pool::Inner::run
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/blocking/pool.rs:325:17
69: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/blocking/pool.rs:300:13
When did this happen? nexmark q7?
When did this happen? nexmark q7?
Yep. However, it's really occasional. 😢
It seems that some key with the same epoch are written to builder by twice?
I am trying to reproduce this bug. Which commit id did you use? Did this happen on compactor or CN?
I am trying to reproduce this bug. Which commit id did you use? Did this happen on compactor or CN?
613ed1a7a6faeefb8c6167b92989cb7bf46faa69, 20*3 parallelisms, on the compactor. (The same configuration for reproducing "stuck again")
Hi @hzxa21 @BugenZhao , any updates?
I cannot reproduce the bug. No updates for now. @zwang28 and I will continue the investigation.
Besides this bug, should we add kind of assertion that ensures the assumption "CNs never write same key with each other in one epoch", so that when similar issues happen, we can narrow down the scope of investigation.
Besides this bug, should we add kind of assertion that ensures the assumption "CNs never write same key with each other in one epoch", so that when similar issues happen, we can narrow down the scope of investigation.
With such sanity check , duplicate key originated from different CNs is spotted in source executor.
compactor-0 | 2022-08-22T06:23:18.554422Z ERROR risingwave_storage::hummock::validator: Duplicate key [74, 0, 0, 3, eb, 32, 34, 2d, 30, ff, f5, c6, 82, 89, de, ff, ff] in SST 62 from worker 3 and SST 42 from worker 1
The hex '32, 34, 2d, 30' in duplicate key is '24-0', that is https://github.com/singularity-data/risingwave/blob/17edbcfe73d7f40cc74713811ec7827c1c40b776/src/connector/src/source/nexmark/split.rs#L31
That's cool! cc @xxchan this might be related to the TPC-H Q18 issue. PTAL if you're interested.
Seems to be originated from nexmark connector. No problem.
It would be better if we could print out the keys for assertion failed. 😢
So there is something wrong with the state of Source? cc. @shanicky @tabVersion
I think I identified the problem. I add some log here https://github.com/singularity-data/risingwave/blob/8abbb65c4414ac08bef57cae325d93885291568c/src/meta/src/stream/source_manager.rs#L561-L573
and I got the following log
{52: ['24-18', '24-7'], 56: ['24-2', '24-5'], 36: ['24-13', '24-8'], 49: ['24-13', '24-3'], 27: ['24-0', '24-16'], 51: ['24-14', '24-23'], 28: ['24-19', '24-17'], 58: ['24-16', '24-21'], 53: ['24-15', '24-1'], 25: ['24-9', '24-14'], 60: ['24-19', '24-12'], 35: ['24-5', '24-11'], 33: ['24-1', '24-7'], 59: ['24-11', '24-8'], 50: ['24-0', '24-6'], 54: ['24-20', '24-17'], 26: ['24-23', '24-15'], 29: ['24-20', '24-2'], 31: ['24-21', '24-12'], 30: ['24-18', '24-3'], 55: ['24-22', '24-4'], 57: ['24-9', '24-10'], 32: ['24-10', '24-6'], 34: ['24-22', '24-4']}
==== build reverse index ====
{'24-18': [52, 30], '24-7': [52, 33], '24-2': [56, 29], '24-5': [56, 35], '24-13': [36, 49], '24-8': [36, 59], '24-3': [49, 30], '24-0': [27, 50], '24-16': [27, 58], '24-14': [51, 25], '24-23': [51, 26], '24-19': [28, 60], '24-17': [28, 54], '24-21': [58, 31], '24-15': [53, 26], '24-1': [53, 33], '24-9': [25, 57], '24-12': [60, 31], '24-11': [35, 59], '24-6': [50, 32], '24-20': [54, 29], '24-22': [55, 34], '24-4': [55, 34], '24-10': [57, 32]}
we can see each SplitId is assigned to two actors, which may on different CNs.
The root cause for the bug is that SplitId is not unique in global
btw: after the crash, CN will use up all CPU resources, it is expected?
related #4817
Besides this bug, should we add kind of assertion that ensures the assumption "CNs never write same key with each other in one epoch", so that when similar issues happen, we can narrow down the scope of investigation.
+1. We should assert it somewhere explicitly instead of panicking on usage.
Verified this issue has been fixed by https://github.com/singularity-data/risingwave/pull/4897. Thanks @tabVersion