Simulator CPU utilization at 100% but nothing happens
I am seeing the simulator hit 100% CPU utilization but with nothing happening at all:
$ cargo run
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.37s
Running `/home/penberg/src/penberg/limbo/target/debug/limbo_sim`
^C
I don't see this behavior on macOS.
The perf top shows we're apparently in step() function:
16.77% limbo_sim [.] limbo_core::vdbe::Program::step
6.84% [kernel] [k] io_idle
3.35% [kernel] [k] __lock_acquire
2.91% limbo_sim [.] limbo_core::storage::pager::Pager::cacheflush
2.45% limbo_sim [.] <limbo_core::storage::wal::WalFile as limbo_core::storage::wal::Wal>::sync
2.43% [kernel] [k] lock_is_held_type
2.37% [kernel] [k] lock_release
2.30% [kernel] [k] unwind_next_frame
2.28% limbo_sim [.] limbo_core::Statement::step
1.68% [kernel] [k] kasan_check_range
1.57% [kernel] [k] rcu_is_watching
RUST_LOG=trace claims we're busy looping in Halt:
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] 15 Halt 0 0 0 0
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] Halt auto_commit true
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] 15 Halt 0 0 0 0
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] Halt auto_commit true
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] 15 Halt 0 0 0 0
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] Halt auto_commit true
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] 15 Halt 0 0 0 0
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] Halt auto_commit true
Something funny going on:
This passes:
$ RUST_LOG=trace cargo run -- --seed=13207426936588942312 --maximum-size=5
This panics:
$ RUST_LOG=trace cargo run -- --seed=13207426936588942312 --maximum-size=6
[2024-12-25T10:04:36Z ERROR limbo_sim] panic occurred ?????????????d
[2024-12-25T10:04:36Z ERROR limbo_sim] cannot sample empty range ????????????????
[2024-12-25T10:04:36Z ERROR limbo_sim] captured backtrace: ???????????????
0: limbo_sim::main::{{closure}} ??????????
at ./main.rs:58:18 ?????????
1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call ??????????????????l
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/boxed.rs:2468:9 ?????????????????????????9
2: std::panicking::rust_panic_with_hook ????????????
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:809:13 ??????????????????????????
3: std::panicking::begin_panic_handler::{{closure}} ???????????????
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:667:13 ??????????????????????????
4: std::sys::backtrace::__rust_end_short_backtrace ??????????????e
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/sys/backtrace.rs:170:18 ???????????????????????????
5: rust_begin_unwind ???????
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:665:5 ??????????????????????????
6: core::panicking::panic_fmt ?????????
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/panicking.rs:74:14 ??????????????????????????
7: core::panicking::panic
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/panicking.rs:148:5
8: rand::rng::Rng::gen_range
at /home/penberg/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rand-0.8.5/src/rng.rs:134:9
9: <limbo_sim::generation::plan::InteractionPlan as limbo_sim::generation::ArbitraryFrom<limbo_sim::runner::env::SimulatorEnv>>::arbitrary_from
at ./generation/plan.rs:173:32
10: limbo_sim::run_simulation::{{closure}}
at ./main.rs:183:18
11: core::iter::adapters::map::map_fold::{{closure}}
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/adapters/map.rs:88:28
12: core::ops::try_trait::NeverShortCircuit<T>::wrap_mut_2::{{closure}}
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/ops/try_trait.rs:392:39
13: <core::ops::range::RangeInclusive<T> as core::iter::range::RangeInclusiveIteratorImpl>::spec_try_fold
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/range.rs:1189:21
14: core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::RangeInclusive<A>>::try_fold
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/range.rs:1308:9
15: core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::RangeInclusive<A>>::fold
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/mod.rs:378:18
16: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/adapters/map.rs:128:9
17: core::iter::traits::iterator::Iterator::for_each
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/traits/iterator.rs:813:9
18: alloc::vec::Vec<T,A>::extend_trusted
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/vec/mod.rs:3524:17
19: <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/vec/spec_extend.rs:26:9
20: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/vec/spec_from_iter_nested.rs:60:9
21: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/vec/spec_from_iter.rs:33:9
22: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/vec/mod.rs:3388:9
23: core::iter::traits::iterator::Iterator::collect
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/traits/iterator.rs:2001:9
24: limbo_sim::run_simulation
at ./main.rs:182:21
25: limbo_sim::main::{{closure}}
at ./main.rs:62:46
26: std::panicking::try::do_call
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:557:40
27: __rust_try
28: std::panicking::try
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:520:19
29: std::panic::catch_unwind
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panic.rs:358:14
30: limbo_sim::main
at ./main.rs:62:18
31: core::ops::function::FnOnce::call_once
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/ops/function.rs:250:5
32: std::sys::backtrace::__rust_begin_short_backtrace
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/sys/backtrace.rs:154:18
33: std::rt::lang_start::{{closure}}
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/rt.rs:195:18
34: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/ops/function.rs:284:13
35: std::panicking::try::do_call
at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:557:40
This hits the infinite loop:
$ RUST_LOG=trace cargo run -- --seed=13207426936588942312 --maximum-size=9
Do you think the behavior is a bug in the simulator, or if the simulator discovered a bug in Limbo?
@alpaylan I am not sure. @jussisaurio commented that https://github.com/tursodatabase/limbo/pull/533 fixes some issues that sound similar to what we're seeing here
As I describe in #533 inserting more values than a table has results in an infinite loop. Initially I didn't check why, but now I did -- see addr 3:
limbo> create table b(b integer primary key);
limbo> explain insert into b values(1,2);
addr opcode p1 p2 p3 p4 p5 comment
---- ----------------- ---- ---- ---- ------------- -- -------
0 Init 0 21 0 0 Start at 21
1 InitCoroutine 3 6 2 0
2 Integer 1 2 0 0 r[2]=1
3 Integer 2 3 0 0 r[3]=2 <-- register allocated for the extra column overwrites the initcoroutine register, messing up the jump target pc
4 Yield 3 0 0 0
5 EndCoroutine 3 0 0 0
6 OpenWriteAsync 0 2 0 0
7 OpenWriteAwait 0 0 0 0
8 Yield 3 20 0 0
9 Copy 2 1 0 0 r[1]=r[2]
10 SoftNull 2 0 0 0
11 NotNull 1 13 0 0 r[1]!=NULL -> goto 13
12 NewRowId 0 1 0 0
13 MustBeInt 1 0 0 0
14 NotExists 0 16 1 0
15 Halt 1555 0 0 0
16 MakeRecord 2 1 4 0 r[4]=mkrec(r[2..2])
17 InsertAsync 0 4 1 0
18 InsertAwait 0 0 0 0
19 Goto 0 8 0 0
20 Halt 0 0 0 0
21 Transaction 0 1 0 0
22 Goto 0 1 0 0
this might still be an entirely separate issue (this one oscillates between two instructions forever)
With 548f66e1cde48b009f79e70de07dce3a2d0e4872 and cf1a3fb3e1b7ef03bfa24f6035fdc6feaa901237, the infinite loop now always reproduces on Linux.
Once I'm done with https://github.com/tursodatabase/limbo/pull/554, I can implement;
- maximum time per interaction
- maximum time for the whole simulation
so we can timeout at certain points, debug the output and understand the reason for the infinite loop bugs. The next big thing would be adding minimizing counterexamples, which requires a bit of thinking and architectural changes, but shouldn't be hard to add. That would give us the minimal sequence of interactions that cause a given bug.
since this is a linux specific issue, it must be something related to flushing to disk on linux, right?
pub fn end_tx(&self) -> Result<CheckpointStatus> {
match self.cacheflush()? {
CheckpointStatus::Done => {}
CheckpointStatus::IO => return Ok(CheckpointStatus::IO),
};
self.wal.borrow().end_read_tx()?;
Ok(CheckpointStatus::Done)
}
this essentially fsyncs the wal, then checkpoints and fsyncs the db file. probably adding some logs in relevant places and running the simulation on linux would reveal what stage of self.cacheflush() it's actually stuck at
Can we make something that will produce log differentials for such cases?
Basically something that runs the code in both environments, collects logs, points out where and how they differ, so that we pinpoint the root causes.
@penberg I added the timeout for the whole simulator by aggregating time between interactions, but it would still fall into an infinite loop if a specific query runs forever. I'm not sure if it's possible to timeout sync code, any ideas?
This no longer reproduces so closing.