dscheck icon indicating copy to clipboard operation
dscheck copied to clipboard

Backtracking optimizations

Open art-w opened this issue 3 years ago • 5 comments

(Follow-up on https://github.com/sadiqj/dscheck/pull/2 )

Sorry for the big dirty PR! I was inspired to optimize the backtracking and it now terminates in a few seconds on the examples that I care about:

  • I started with some simple stuff, like not re-running the execution from scratch unless we really have to (= we are backtracking)
  • The backtracking steps happen when two "threads" access the same atomic: I thought it could be more precise, as we don't need to test the alternative ordering if both were doing an Atomic.get (we only want to backtrack when the atomic usages could yield a different outcome, so a "read" with a "write" etc.)
  • The backtracking was scheduling the alternate thread for only one operation: In general, this operation is not the one we wanted to permute (just a prerequisite) and so it had to do it multiple times before the real goal was achieved. In the example below, if we wanted to swap (A3) with (B3) then it would bubble the B operations one at a time:
AAA(A3)AAB1B2(B3)...
   \B1(A3)AAB2(B3)...
     \B2(A3)AA(B3)...
       \B3(A3)AA...

It now does a "big" backstep by scheduling all the operations to permute in the backtracking:

AAA(A3)AAB1B2(B3)...
   \B1B2(B3)(A3)AA...

This wasn't so simple though, as now the atomics and spawns can be created in a different order, changing their globally unique int identifiers. The fix to give them stable ids was to track the creations locally in each thread rather than globally (so rather than "I'm the Xth atomic ever created" we do "I'm the Xth atomic created by thread Yth... which was itself created by thread Zth from the root" with X,Y,Z local counters to each parent threads).

Furthermore, we sometimes want to backstep a thread operation to a point where the thread didn't exist yet... and so we need to also backstep its parent operations that led to our thread creation.

Anyway, I tested it intensively and hopefully I didn't mess up too badly but this was quite involved so there might be some bugs lurking... Let me know if I can do anything to ease the review :)

art-w avatar Oct 03 '22 10:10 art-w

Thanks for the PR, very cool stuff!

I'm still reviewing it - feel free handle the comments as they flow in or all at once later on. Fwiw, I've run a lot of test traces already and it's holding up perfectly well.

bartoszmodelski avatar Oct 24 '22 19:10 bartoszmodelski

Rebased and fixed some issues, thanks again for the review! :)

However, I then stumbled upon a new test where dscheck should report an issue but doesn't (I'm not sure if the bug was introduced in this PR, as the test doesn't terminate in previous versions). It's related to backtracking when updating an Atomic created by another domain... but I haven't fully debugged it yet.

art-w avatar Nov 21 '22 12:11 art-w

I then stumbled upon a new test where dscheck should report an issue but doesn't (I'm not sure if the bug was introduced in this PR, as the test doesn't terminate in previous versions)

I have a test that (correctly) fails with main but passes with this PR.

git clone https://github.com/talex5/eio.git --branch dscheck-fast-bug
make dscheck

With main (c74d8a6):

Fatal error: exception File "lib_eio/core/test_cells/cells.ml", line 104, characters 46-52: Assertion failed

With this PR:

Finished after 2428 runs.

However, I'm probably just confused about how this is supposed to be used. I opened #13 with some questions about that.

talex5 avatar Dec 28 '22 11:12 talex5

OK, here's a simpler test-case that detects the bug with main but not with this PR.

module Atomic = Dscheck.TracedAtomic
                  
let test () =
  let cancelled = Atomic.make false in
  let max_requests = Atomic.make 0 in
  Atomic.spawn (fun () ->
      Atomic.set cancelled true;
      Atomic.decr max_requests;
    );
  Atomic.spawn (fun () ->
      ignore (Atomic.get max_requests);
      assert (Atomic.get cancelled)     (* This bug should be detected *)
    )

let () = Atomic.trace test

With main:

Fatal error: exception File "bin/main.ml", line 12, characters 6-12: Assertion failed

With this PR:

Finished after 2 runs.

talex5 avatar Dec 28 '22 19:12 talex5

Thanks for the two counter examples! I ended up rewriting the backstepping algorithm to be a lot more straightforward (... it's still too tricky for my liking, but the structure makes it harder for it to go wrong by missing a branch)

It's of course slower than before because it doesn't wrongly skip as many traces, but it can still complete the eio testsuite if you're patient enough ^^'

art-w avatar Feb 23 '23 01:02 art-w