risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

bug: compactor block builder panics on assertion failed

Open BugenZhao opened this issue 3 years ago • 5 comments

https://github.com/singularity-data/risingwave/blob/001f5ac22ca46aae2e2df3dc265e4d0fc6c12d81/src/storage/src/hummock/sstable/block.rs#L227-L247

thread 'tokio-runtime-worker' panicked at 'assertion failed: `(left == right)`
  left: `Equal`,
 right: `Less`', src/storage/src/hummock/sstable/block.rs:243:13
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/panicking.rs:181:5
   4: risingwave_storage::hummock::sstable::block::BlockBuilder::add
             at ./src/storage/src/hummock/sstable/block.rs:243:13
   5: risingwave_storage::hummock::sstable::builder::SstableBuilder::add
             at ./src/storage/src/hummock/sstable/builder.rs:130:9
   6: risingwave_storage::hummock::sstable::multi_builder::CapacitySplitTableBuilder<F>::add_full_key::{{closure}}
             at ./src/storage/src/hummock/sstable/multi_builder.rs:121:9
   7: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
   8: risingwave_storage::hummock::compactor::Compactor::compact_and_build_sst::{{closure}}
             at ./src/storage/src/hummock/compactor.rs:1021:92
   9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
  10: risingwave_storage::hummock::compactor::Compactor::compact_key_range_impl::{{closure}}
             at ./src/storage/src/hummock/compactor.rs:716:10
  11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
  12: risingwave_storage::hummock::compactor::Compactor::compact_key_range_with_filter::{{closure}}
             at ./src/storage/src/hummock/compactor.rs:782:74
  13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
  14: risingwave_storage::hummock::compactor::Compactor::compact::{{closure}}::{{closure}}
             at ./src/storage/src/hummock/compactor.rs:556:84
  15: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
  16: risingwave_storage::hummock::compaction_executor::CompactionExecutor::send_request::{{closure}}
             at ./src/storage/src/hummock/compaction_executor.rs:76:27
  17: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
  18: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/future.rs:124:9
  19: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/task/task_local.rs:345:35
  20: tokio::task::task_local::LocalKey<T>::scope_inner
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/task/task_local.rs:231:19
  21: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/task/task_local.rs:342:13
  22: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.35/src/instrument.rs:272:9
  23: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/core.rs:165:17
  24: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/loom/std/unsafe_cell.rs:14:9
  25: tokio::runtime::task::core::CoreStage<T>::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/core.rs:155:13
  26: tokio::runtime::task::harness::poll_future::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:480:19
  27: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/panic/unwind_safe.rs:271:9
  28: std::panicking::try::do_call
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:492:40
  29: ___rust_try
  30: std::panicking::try
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:456:19
  31: std::panic::catch_unwind
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panic.rs:137:14
  32: tokio::runtime::task::harness::poll_future
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:468:18
  33: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:104:27
  34: tokio::runtime::task::harness::Harness<T,S>::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:57:15
  35: tokio::runtime::task::raw::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/raw.rs:144:5
  36: tokio::runtime::task::raw::RawTask::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/raw.rs:84:18
  37: tokio::runtime::task::LocalNotified<S>::run
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/mod.rs:381:9
  38: tokio::runtime::thread_pool::worker::Context::run_task::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:435:13
  39: tokio::coop::with_budget::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/coop.rs:102:9
  40: std::thread::local::LocalKey<T>::try_with
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/thread/local.rs:445:16
  41: std::thread::local::LocalKey<T>::with
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/thread/local.rs:421:9
  42: tokio::coop::with_budget
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/coop.rs:95:5
  43: tokio::coop::budget
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/coop.rs:72:5
  44: tokio::runtime::thread_pool::worker::Context::run_task
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:434:9
  45: tokio::runtime::thread_pool::worker::Context::run
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:401:24
  46: tokio::runtime::thread_pool::worker::run::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:386:17
  47: tokio::macros::scoped_tls::ScopedKey<T>::set
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/macros/scoped_tls.rs:61:9
  48: tokio::runtime::thread_pool::worker::run
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:383:5
  49: tokio::runtime::thread_pool::worker::Launch::launch::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/thread_pool/worker.rs:362:45
  50: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/blocking/task.rs:42:21
  51: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.35/src/instrument.rs:272:9
  52: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/core.rs:165:17
  53: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/loom/std/unsafe_cell.rs:14:9
  54: tokio::runtime::task::core::CoreStage<T>::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/core.rs:155:13
  55: tokio::runtime::task::harness::poll_future::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:480:19
  56: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/panic/unwind_safe.rs:271:9
  57: std::panicking::try::do_call
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:492:40
  58: ___rust_try
  59: std::panicking::try
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:456:19
  60: std::panic::catch_unwind
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panic.rs:137:14
  61: tokio::runtime::task::harness::poll_future
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:468:18
  62: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:104:27
  63: tokio::runtime::task::harness::Harness<T,S>::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/harness.rs:57:15
  64: tokio::runtime::task::raw::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/raw.rs:144:5
  65: tokio::runtime::task::raw::RawTask::poll
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/raw.rs:84:18
  66: tokio::runtime::task::UnownedTask<S>::run
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/task/mod.rs:418:9
  67: tokio::runtime::blocking::pool::Task::run
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/blocking/pool.rs:91:9
  68: tokio::runtime::blocking::pool::Inner::run
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/blocking/pool.rs:325:17
  69: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/runtime/blocking/pool.rs:300:13

BugenZhao avatar Aug 04 '22 06:08 BugenZhao

When did this happen? nexmark q7?

hzxa21 avatar Aug 04 '22 07:08 hzxa21

When did this happen? nexmark q7?

Yep. However, it's really occasional. 😢

BugenZhao avatar Aug 04 '22 09:08 BugenZhao

It seems that some key with the same epoch are written to builder by twice?

Little-Wallace avatar Aug 05 '22 03:08 Little-Wallace

I am trying to reproduce this bug. Which commit id did you use? Did this happen on compactor or CN?

hzxa21 avatar Aug 09 '22 14:08 hzxa21

I am trying to reproduce this bug. Which commit id did you use? Did this happen on compactor or CN?

613ed1a7a6faeefb8c6167b92989cb7bf46faa69, 20*3 parallelisms, on the compactor. (The same configuration for reproducing "stuck again")

BugenZhao avatar Aug 09 '22 14:08 BugenZhao

Hi @hzxa21 @BugenZhao , any updates?

fuyufjh avatar Aug 17 '22 06:08 fuyufjh

I cannot reproduce the bug. No updates for now. @zwang28 and I will continue the investigation.

hzxa21 avatar Aug 17 '22 07:08 hzxa21

Besides this bug, should we add kind of assertion that ensures the assumption "CNs never write same key with each other in one epoch", so that when similar issues happen, we can narrow down the scope of investigation.

zwang28 avatar Aug 18 '22 03:08 zwang28

Besides this bug, should we add kind of assertion that ensures the assumption "CNs never write same key with each other in one epoch", so that when similar issues happen, we can narrow down the scope of investigation.

With such sanity check , duplicate key originated from different CNs is spotted in source executor.

compactor-0 | 2022-08-22T06:23:18.554422Z ERROR risingwave_storage::hummock::validator: Duplicate key [74, 0, 0, 3, eb, 32, 34, 2d, 30, ff, f5, c6, 82, 89, de, ff, ff] in SST 62 from worker 3 and SST 42 from worker 1

The hex '32, 34, 2d, 30' in duplicate key is '24-0', that is https://github.com/singularity-data/risingwave/blob/17edbcfe73d7f40cc74713811ec7827c1c40b776/src/connector/src/source/nexmark/split.rs#L31

zwang28 avatar Aug 22 '22 06:08 zwang28

That's cool! cc @xxchan this might be related to the TPC-H Q18 issue. PTAL if you're interested.

Seems to be originated from nexmark connector. No problem.

skyzh avatar Aug 22 '22 06:08 skyzh

It would be better if we could print out the keys for assertion failed. 😢

BugenZhao avatar Aug 22 '22 06:08 BugenZhao

So there is something wrong with the state of Source? cc. @shanicky @tabVersion

fuyufjh avatar Aug 22 '22 06:08 fuyufjh

I think I identified the problem. I add some log here https://github.com/singularity-data/risingwave/blob/8abbb65c4414ac08bef57cae325d93885291568c/src/meta/src/stream/source_manager.rs#L561-L573

and I got the following log

{52: ['24-18', '24-7'], 56: ['24-2', '24-5'], 36: ['24-13', '24-8'], 49: ['24-13', '24-3'], 27: ['24-0', '24-16'], 51: ['24-14', '24-23'], 28: ['24-19', '24-17'], 58: ['24-16', '24-21'], 53: ['24-15', '24-1'], 25: ['24-9', '24-14'], 60: ['24-19', '24-12'], 35: ['24-5', '24-11'], 33: ['24-1', '24-7'], 59: ['24-11', '24-8'], 50: ['24-0', '24-6'], 54: ['24-20', '24-17'], 26: ['24-23', '24-15'], 29: ['24-20', '24-2'], 31: ['24-21', '24-12'], 30: ['24-18', '24-3'], 55: ['24-22', '24-4'], 57: ['24-9', '24-10'], 32: ['24-10', '24-6'], 34: ['24-22', '24-4']}
==== build reverse index ====
{'24-18': [52, 30], '24-7': [52, 33], '24-2': [56, 29], '24-5': [56, 35], '24-13': [36, 49], '24-8': [36, 59], '24-3': [49, 30], '24-0': [27, 50], '24-16': [27, 58], '24-14': [51, 25], '24-23': [51, 26], '24-19': [28, 60], '24-17': [28, 54], '24-21': [58, 31], '24-15': [53, 26], '24-1': [53, 33], '24-9': [25, 57], '24-12': [60, 31], '24-11': [35, 59], '24-6': [50, 32], '24-20': [54, 29], '24-22': [55, 34], '24-4': [55, 34], '24-10': [57, 32]}

we can see each SplitId is assigned to two actors, which may on different CNs.

The root cause for the bug is that SplitId is not unique in global

btw: after the crash, CN will use up all CPU resources, it is expected?

tabVersion avatar Aug 23 '22 03:08 tabVersion

related #4817

tabVersion avatar Aug 23 '22 07:08 tabVersion

Besides this bug, should we add kind of assertion that ensures the assumption "CNs never write same key with each other in one epoch", so that when similar issues happen, we can narrow down the scope of investigation.

+1. We should assert it somewhere explicitly instead of panicking on usage.

BugenZhao avatar Aug 23 '22 10:08 BugenZhao

Verified this issue has been fixed by https://github.com/singularity-data/risingwave/pull/4897. Thanks @tabVersion

zwang28 avatar Aug 26 '22 05:08 zwang28