turso icon indicating copy to clipboard operation
turso copied to clipboard

Simulator CPU utilization at 100% but nothing happens

Open penberg opened this issue 1 year ago • 11 comments

I am seeing the simulator hit 100% CPU utilization but with nothing happening at all:

$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.37s
     Running `/home/penberg/src/penberg/limbo/target/debug/limbo_sim`
^C

I don't see this behavior on macOS.

penberg avatar Dec 25 '24 09:12 penberg

The perf top shows we're apparently in step() function:

  16.77%  limbo_sim                                       [.] limbo_core::vdbe::Program::step
   6.84%  [kernel]                                        [k] io_idle
   3.35%  [kernel]                                        [k] __lock_acquire
   2.91%  limbo_sim                                       [.] limbo_core::storage::pager::Pager::cacheflush
   2.45%  limbo_sim                                       [.] <limbo_core::storage::wal::WalFile as limbo_core::storage::wal::Wal>::sync
   2.43%  [kernel]                                        [k] lock_is_held_type
   2.37%  [kernel]                                        [k] lock_release
   2.30%  [kernel]                                        [k] unwind_next_frame
   2.28%  limbo_sim                                       [.] limbo_core::Statement::step
   1.68%  [kernel]                                        [k] kasan_check_range
   1.57%  [kernel]                                        [k] rcu_is_watching

penberg avatar Dec 25 '24 09:12 penberg

RUST_LOG=trace claims we're busy looping in Halt:

[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] 15    Halt               0     0     0                    0   
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] Halt auto_commit true
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] 15    Halt               0     0     0                    0   
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] Halt auto_commit true
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] 15    Halt               0     0     0                    0   
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] Halt auto_commit true
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] 15    Halt               0     0     0                    0   
[2024-12-25T10:00:19Z TRACE limbo_core::vdbe] Halt auto_commit true

penberg avatar Dec 25 '24 10:12 penberg

Something funny going on:

This passes:

$ RUST_LOG=trace cargo run -- --seed=13207426936588942312 --maximum-size=5

This panics:

$ RUST_LOG=trace cargo run -- --seed=13207426936588942312 --maximum-size=6
[2024-12-25T10:04:36Z ERROR limbo_sim] panic occurred                                                                                                       ?????????????d                                                                                                                                                  
[2024-12-25T10:04:36Z ERROR limbo_sim] cannot sample empty range                                                                                            ????????????????                                                                                                                                                
[2024-12-25T10:04:36Z ERROR limbo_sim] captured backtrace:                                                                                                  ???????????????                                                                                                                                                 
       0: limbo_sim::main::{{closure}}                                                                                                                      ??????????                                                                                                                                              
                 at ./main.rs:58:18                                                                                                                         ?????????                                                                                                                                   
       1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call                                                                                   ??????????????????l                                                                                                                                             
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/boxed.rs:2468:9                                                       ?????????????????????????9                                                                                                                                      
       2: std::panicking::rust_panic_with_hook                                                                                                              ????????????                                                                                                                                                    
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:809:13                                                     ??????????????????????????                                                                                                                                      
       3: std::panicking::begin_panic_handler::{{closure}}                                                                                                  ???????????????                                                                                                                                                 
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:667:13                                                     ??????????????????????????                                                                                                                                      
       4: std::sys::backtrace::__rust_end_short_backtrace                                                                                                   ??????????????e                                                                                                                                                 
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/sys/backtrace.rs:170:18                                                 ???????????????????????????                                                                                                                                     
       5: rust_begin_unwind                                                                                                                                 ???????                                                                                                     
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:665:5                                                      ??????????????????????????                                                                                                                                      
       6: core::panicking::panic_fmt                                                                                                                        ?????????                                                                                                                                       
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/panicking.rs:74:14                                                     ??????????????????????????                                                                                                                                      
       7: core::panicking::panic
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/panicking.rs:148:5
       8: rand::rng::Rng::gen_range
                 at /home/penberg/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rand-0.8.5/src/rng.rs:134:9
       9: <limbo_sim::generation::plan::InteractionPlan as limbo_sim::generation::ArbitraryFrom<limbo_sim::runner::env::SimulatorEnv>>::arbitrary_from
                 at ./generation/plan.rs:173:32
      10: limbo_sim::run_simulation::{{closure}}
                 at ./main.rs:183:18
      11: core::iter::adapters::map::map_fold::{{closure}}
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/adapters/map.rs:88:28
      12: core::ops::try_trait::NeverShortCircuit<T>::wrap_mut_2::{{closure}}
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/ops/try_trait.rs:392:39
      13: <core::ops::range::RangeInclusive<T> as core::iter::range::RangeInclusiveIteratorImpl>::spec_try_fold
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/range.rs:1189:21
      14: core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::RangeInclusive<A>>::try_fold
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/range.rs:1308:9
      15: core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::RangeInclusive<A>>::fold
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/mod.rs:378:18
      16: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/adapters/map.rs:128:9
      17: core::iter::traits::iterator::Iterator::for_each
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/traits/iterator.rs:813:9
      18: alloc::vec::Vec<T,A>::extend_trusted
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/vec/mod.rs:3524:17
      19: <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/vec/spec_extend.rs:26:9
      20: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/vec/spec_from_iter_nested.rs:60:9
      21: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/vec/spec_from_iter.rs:33:9
      22: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/alloc/src/vec/mod.rs:3388:9
      23: core::iter::traits::iterator::Iterator::collect
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/iter/traits/iterator.rs:2001:9
      24: limbo_sim::run_simulation
                 at ./main.rs:182:21
      25: limbo_sim::main::{{closure}}
                 at ./main.rs:62:46
      26: std::panicking::try::do_call
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:557:40
      27: __rust_try
      28: std::panicking::try
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:520:19
      29: std::panic::catch_unwind
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panic.rs:358:14
      30: limbo_sim::main
                 at ./main.rs:62:18
      31: core::ops::function::FnOnce::call_once
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/ops/function.rs:250:5
      32: std::sys::backtrace::__rust_begin_short_backtrace
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/sys/backtrace.rs:154:18
      33: std::rt::lang_start::{{closure}}
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/rt.rs:195:18
      34: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/core/src/ops/function.rs:284:13
      35: std::panicking::try::do_call
                 at /rustc/90b35a6239c3d8bdabc530a6a0816f7ff89a0aaf/library/std/src/panicking.rs:557:40

This hits the infinite loop:

$ RUST_LOG=trace cargo run -- --seed=13207426936588942312 --maximum-size=9

penberg avatar Dec 25 '24 10:12 penberg

Do you think the behavior is a bug in the simulator, or if the simulator discovered a bug in Limbo?

alpaylan avatar Dec 25 '24 15:12 alpaylan

@alpaylan I am not sure. @jussisaurio commented that https://github.com/tursodatabase/limbo/pull/533 fixes some issues that sound similar to what we're seeing here

penberg avatar Dec 25 '24 17:12 penberg

As I describe in #533 inserting more values than a table has results in an infinite loop. Initially I didn't check why, but now I did -- see addr 3:

limbo> create table b(b integer primary key);
limbo> explain insert into b values(1,2);
addr  opcode             p1    p2    p3    p4             p5  comment
----  -----------------  ----  ----  ----  -------------  --  -------
0     Init               0     21    0                    0   Start at 21
1     InitCoroutine      3     6     2                    0   
2     Integer            1     2     0                    0   r[2]=1
3     Integer            2     3     0                    0   r[3]=2 <-- register allocated for the extra column overwrites the initcoroutine register, messing up the jump target pc
4     Yield              3     0     0                    0   
5     EndCoroutine       3     0     0                    0   
6     OpenWriteAsync     0     2     0                    0   
7     OpenWriteAwait     0     0     0                    0   
8     Yield              3     20    0                    0   
9     Copy               2     1     0                    0   r[1]=r[2]
10    SoftNull           2     0     0                    0   
11    NotNull            1     13    0                    0   r[1]!=NULL -> goto 13
12    NewRowId           0     1     0                    0   
13    MustBeInt          1     0     0                    0   
14    NotExists          0     16    1                    0   
15    Halt               1555  0     0                    0   
16    MakeRecord         2     1     4                    0   r[4]=mkrec(r[2..2])
17    InsertAsync        0     4     1                    0   
18    InsertAwait        0     0     0                    0   
19    Goto               0     8     0                    0   
20    Halt               0     0     0                    0   
21    Transaction        0     1     0                    0   
22    Goto               0     1     0                    0 

this might still be an entirely separate issue (this one oscillates between two instructions forever)

jussisaurio avatar Dec 25 '24 21:12 jussisaurio

With 548f66e1cde48b009f79e70de07dce3a2d0e4872 and cf1a3fb3e1b7ef03bfa24f6035fdc6feaa901237, the infinite loop now always reproduces on Linux.

penberg avatar Dec 27 '24 08:12 penberg

Once I'm done with https://github.com/tursodatabase/limbo/pull/554, I can implement;

  • maximum time per interaction
  • maximum time for the whole simulation

so we can timeout at certain points, debug the output and understand the reason for the infinite loop bugs. The next big thing would be adding minimizing counterexamples, which requires a bit of thinking and architectural changes, but shouldn't be hard to add. That would give us the minimal sequence of interactions that cause a given bug.

alpaylan avatar Dec 27 '24 17:12 alpaylan

since this is a linux specific issue, it must be something related to flushing to disk on linux, right?

    pub fn end_tx(&self) -> Result<CheckpointStatus> {
        match self.cacheflush()? {
            CheckpointStatus::Done => {}
            CheckpointStatus::IO => return Ok(CheckpointStatus::IO),
        };
        self.wal.borrow().end_read_tx()?;
        Ok(CheckpointStatus::Done)
    }

this essentially fsyncs the wal, then checkpoints and fsyncs the db file. probably adding some logs in relevant places and running the simulation on linux would reveal what stage of self.cacheflush() it's actually stuck at

jussisaurio avatar Jan 01 '25 06:01 jussisaurio

Can we make something that will produce log differentials for such cases?

Basically something that runs the code in both environments, collects logs, points out where and how they differ, so that we pinpoint the root causes.

alpaylan avatar Jan 01 '25 07:01 alpaylan

@penberg I added the timeout for the whole simulator by aggregating time between interactions, but it would still fall into an infinite loop if a specific query runs forever. I'm not sure if it's possible to timeout sync code, any ideas?

alpaylan avatar Jan 06 '25 12:01 alpaylan

This no longer reproduces so closing.

penberg avatar Jan 14 '25 15:01 penberg