dscheck
dscheck copied to clipboard
Backtracking optimizations
(Follow-up on https://github.com/sadiqj/dscheck/pull/2 )
Sorry for the big dirty PR! I was inspired to optimize the backtracking and it now terminates in a few seconds on the examples that I care about:
- I started with some simple stuff, like not re-running the execution from scratch unless we really have to (= we are backtracking)
- The backtracking steps happen when two "threads" access the same atomic: I thought it could be more precise, as we don't need to test the alternative ordering if both were doing an
Atomic.get(we only want to backtrack when the atomic usages could yield a different outcome, so a "read" with a "write" etc.) - The backtracking was scheduling the alternate thread for only one operation: In general, this operation is not the one we wanted to permute (just a prerequisite) and so it had to do it multiple times before the real goal was achieved. In the example below, if we wanted to swap
(A3)with(B3)then it would bubble theBoperations one at a time:
AAA(A3)AAB1B2(B3)...
\B1(A3)AAB2(B3)...
\B2(A3)AA(B3)...
\B3(A3)AA...
It now does a "big" backstep by scheduling all the operations to permute in the backtracking:
AAA(A3)AAB1B2(B3)...
\B1B2(B3)(A3)AA...
This wasn't so simple though, as now the atomics and spawns can be created in a different order, changing their globally unique int identifiers. The fix to give them stable ids was to track the creations locally in each thread rather than globally (so rather than "I'm the Xth atomic ever created" we do "I'm the Xth atomic created by thread Yth... which was itself created by thread Zth from the root" with X,Y,Z local counters to each parent threads).
Furthermore, we sometimes want to backstep a thread operation to a point where the thread didn't exist yet... and so we need to also backstep its parent operations that led to our thread creation.
Anyway, I tested it intensively and hopefully I didn't mess up too badly but this was quite involved so there might be some bugs lurking... Let me know if I can do anything to ease the review :)
Thanks for the PR, very cool stuff!
I'm still reviewing it - feel free handle the comments as they flow in or all at once later on. Fwiw, I've run a lot of test traces already and it's holding up perfectly well.
Rebased and fixed some issues, thanks again for the review! :)
However, I then stumbled upon a new test where dscheck should report an issue but doesn't (I'm not sure if the bug was introduced in this PR, as the test doesn't terminate in previous versions). It's related to backtracking when updating an Atomic created by another domain... but I haven't fully debugged it yet.
I then stumbled upon a new test where dscheck should report an issue but doesn't (I'm not sure if the bug was introduced in this PR, as the test doesn't terminate in previous versions)
I have a test that (correctly) fails with main but passes with this PR.
git clone https://github.com/talex5/eio.git --branch dscheck-fast-bug
make dscheck
With main (c74d8a6):
Fatal error: exception File "lib_eio/core/test_cells/cells.ml", line 104, characters 46-52: Assertion failed
With this PR:
Finished after 2428 runs.
However, I'm probably just confused about how this is supposed to be used. I opened #13 with some questions about that.
OK, here's a simpler test-case that detects the bug with main but not with this PR.
module Atomic = Dscheck.TracedAtomic
let test () =
let cancelled = Atomic.make false in
let max_requests = Atomic.make 0 in
Atomic.spawn (fun () ->
Atomic.set cancelled true;
Atomic.decr max_requests;
);
Atomic.spawn (fun () ->
ignore (Atomic.get max_requests);
assert (Atomic.get cancelled) (* This bug should be detected *)
)
let () = Atomic.trace test
With main:
Fatal error: exception File "bin/main.ml", line 12, characters 6-12: Assertion failed
With this PR:
Finished after 2 runs.
Thanks for the two counter examples! I ended up rewriting the backstepping algorithm to be a lot more straightforward (... it's still too tricky for my liking, but the structure makes it harder for it to go wrong by missing a branch)
It's of course slower than before because it doesn't wrongly skip as many traces, but it can still complete the eio testsuite if you're patient enough ^^'