quickcheck-dynamic
quickcheck-dynamic copied to clipboard
First draft of parallel state machines
Implements runParallelActions which will spawn multiple threads to test parallel execution in a style similar to eqc_statem.
Generation of parallel actions
The way I'm generating parallel actions is somewhat naive. I generate a list of normal Actions and then I split the list into a sequential prefix and some (currently 2) parallel-safe suffixes.
I check that every possible interleaving of actions in those parallel suffixes preserves every precondition.
However Actions_ has two things I don't see how to fit here:
- The rejected actions, what should I put there in the parallel suffixes?
- The number for the
Smartconstructor, is0okay?
Also if generation fails, I just try again.
I also currently don't have a shrinker :(
Combination of environments
When I'm generating all possible interleavings of executions, I can for example face this parallel sequence with the noted results after execution:
/- [D,E,F] --- returns ([vA,vB,vC,vD,vE,vF], [rD,rE,rF])
[A,B,C] -|
\- [G,H,I] --- returns ([vA,vB,vC,vG,vH,vI], [rG,rH,rI])
For now I'm concat'ing all the environments because I assume the LookUp can deal with duplicates? They will still be uniquely identified.
perform can't have access to the state
In parallel execution, perform will not be able to access any state because we don't advance the state at the same time we run the actions, so I had to add perform' which doesn't have that first parameter.
Checking all interleavings kills the machine
If we have not small suffixes, all possible interleavings are just too many, the number of interleavings rises very very fast.
ghci> map (\x -> let a = [1 :: Int ..x] in length $ interleaves [a, a]) [1..10]
[2,6,20,70,252,924,3432,12870,48620,184756]
Perhaps we should limit the number of parallel actions to a reasonable small number?
2 or more threads?
eqc_statem generates a list of lists of actions, where each list of actions will be in a separate process. Should we support this too or is 2 threads enough? It will also worsen the problem above.
map (\x -> let a = [1 :: Int ..x] in length $ interleaves [a, a, a]) [1..10]
[6,90,1680,34650,756756,17153136,killed
@jasagredo I have started having a look and playing with your PR, which lead me to investigate how this feature was implemented in q-s-m.
- As written, the parallel test does not terminate because of the combinatorial explosion of interleavings. Empirically I have found the limit to be around 12-13, and managed to make the test pass by using a
sizedgenerator and reducing the length of generated actions sequence to the square root of the size parameter - In q-s-m, this interleaving computation is limited to sequence of at most 5 actions, but applied on successive segments on the initial sequence, which makes it possible to test longer sequences
Perhaps a good first approach would be to try to port the q-s-m code?
A comment on the number of parallel threads: some of the most interesting bugs I've found have required three threads--see https://dl.acm.org/doi/abs/10.1145/2034654.2034667 for example. So I think it would be a real shame to restrict parallel tests to two threads. I can imagine adding a Fork action (with a list of actions as a parameter) to express an arbitrary number of threads. If Fork returns a thread ID, one could even imagine Join ThreadId actions that wait for a forked thread to terminate. The nice thing about adding Fork and Join actions is that they could be used in DL tests too: imagine a DL test of the form anyActions; Fork anyActions; <some DL sequence>, for example, that would test that the DL sequence worked in any reachable state, with anything going on in parallel. There is a potential problem with a combinatorial explosion of interleavings, but that can be addressed during generation just by not generating test cases with too many interleavings--the number of interleavings is easily (and cheaply) predictable from a test case, so one can just stop generating before the number gets too big (e.g. 100 or 1000). e.g. Fork [a,b]; Fork [c,d]; e; f has (6 choose 2) * (4 choose 2) interleavings, i.e. 15*6 or 90. I don't think Fork poses any real problems here, although I haven't figured out how to handle Join in detail.
Shrinking is really important, of course, and there are useful shrinking steps for parallel tests (in addition to shrinking or just dropping actions) that reduce parallelism, making failed tests easier to debug. One: move the first action out of a Fork. Two: move a Fork later (past the next non-Fork action, to ensure we reduce the set of possible interleavings, which is necessary to avoid shrinking loops).
There's a problem as is with the type of postconditions (PostconditionM). In this version, postconditions can interact with the system under test, and indeed, this is used in several places in IOG's code to read from the system state in the postcondition. But in parallel testing, postconditions cannot be evaluated while the test is run, instead one must explore many possible interleavings after the test has finished, which means each postcondition may run many times, and in the final state of the test, not in the state after the action it applies to. If model authors expect the postcondition to read form the state just after that action, then this will lead tests to fail--in ways which developers find very difficult to understand. Parallel testing really needs pure boolean postconditions--any interaction with the system under test is likely to be broken, and even monitoring information will differ for each interleaving, making counterExample in particular almost useless (which failed interleaving should you be given counterexample information about, when they all failed?).
To make this work I think one needs a separate RunModel' class whose postconditions have a simpler type, which could be used for both sequential and parallel testing, while the existing RunModel class can be used only for sequential testing. (If it's used for parallel testing, then a trap is laid for developers who talk to the SUT in postconditions, and it would be nice to exclude that possibility using types). Working with such a class does require modelling more of the SUT... in order to write the postconditions, one needs to model the information in the SUT that was previously read from it, so that it is available in the postcondition when it is needed. So there's a risk users might continue to use the current RunModel class, because it simplifies modelling, but that would make parallel testing impossible. It's a bit Catch 22--what makes parallel testing so nice in e.g. eqc_statem is that it is so easy to turn it on, using the same model used for sequential testing; if one has to enrich one's model and rewrite one's postconditions, then people are much less likely to use it.
@rjmh Do I understand correctly that in Erlang one defines postconditions in the "model" (StateModel in our case) and therefore it's pretty straightforward to run in both sequential and parallel settings?
Interestingly, in the work done by @MaximilianAlgehed and @UlfNorell on Peras, they introduced this step : State -> Action -> Maybe (State, Result) function on the Agda side that's mapped to the nextState, precondition, and postcondition functions very straigtforwardly, assuming the postcondition to be simply equality between expected and actual results.
It seems to me it would not be too hard to define a PostCondition a language, much like the typical assertEquals, assertContains from example-based testing, that would make it possible to move back the postcondition to the StateModel.
The issue with pure postconditions is that you either end up:
- Making the result type of actions way too complex,
- Introducing ancillary actions with complex preconditions that force a bunch of annoying book-keeping in the model state, or
- Give up on checking things that you should have checked.
I don't remember the specific case that forced us to make postconditions monadic, but it was the least bad solution to avoiding the issues above at the time.
In the Java world I used to worked in, there was a cottage industry of "assertion libraries" (eg. hamcrest) which, when they came out, improved the legibility of our unit tests. Some Haskell test frameworks (eg. hspec) have specific functions for testing expectations.
I may be wrong but it seems something like the following would work:
class StateModel m where
data Action m a = ...
postcondition :: m -> Action m a -> PostCondition a
where (roughly)
data Postcondition a =
Eq a => Equal a | Contains [a] | Ord a => Compare a | ...
would define a pure language of assertions.
Then the framework would be responsible for checking the postconditions from actual results.
@abailly-iohk It seems you are almost talking about https://hackage.haskell.org/package/quickcheck-state-machine-0.9.0/docs/Test-StateMachine-Logic.html, right?
@abailly-iohk It seems you are almost talking about https://hackage.haskell.org/package/quickcheck-state-machine-0.9.0/docs/Test-StateMachine-Logic.html, right?
I wasn't aware of this module, seems like there's more good things to port from q-s-m than just the parallel testing stuff :)
Max wrote:
The issue with pure postconditions is that you either end up:
- Making the result type of actions way too complex,
- Introducing ancillary actions with complex preconditions that force a bunch of annoying book-keeping in the model state, or
- Give up on checking things that you should have checked.
I think the issue is that when you want to write postconditions that relate the result of an action to the system state, you can either make your model detailed enough to predict that state (that's Max's annoying book-keeping), or read it directly from the system state--which you can either do in a monadic postcondition, or via a special action (Max's ancillary actions), or via reading them in the same call of perform, which means returning extra information in the action result type. And the general problem with reading from the system state, however you do it, is that those reads are hard to do atomically with the action they apply to. In parallel testing, the postconditions are checked after the entire test runs, so they simply read the wrong system state, while if the reads are done in a custom action or in perform, they will not be done atomically with the underlying action, and thus risk reading the wrong system state. One might perhaps wrap a lock around the underlying action and the associated reads inside perform to make everything atomic, but that defeats the purpose of parallel testing, which is to test that the action behaves atomically without additional synchronization.
So basically, parallel testing is incompatible with reading extra information from the system state to write postconditions, however that is done. The only way to make parallel testing work is to enrich the model to predict (enough of) that information. Sometimes this will mean complicating the model considerably. And for sequential testing, it's not necessary.
There is a lot to be said for making the shift from sequential to parallel testing easy, though--partly, it makes it much more likely that people will do it, than if they have to rewrite all their postconditions with a different type. Secondly, it allows the model to be debugged using sequential tests, which is far simpler and quicker than debugging using parallel tests.
Here's an idea: one might allow monadic postconditions, but interpret any postcondition that actually invokes an underlying monadic operation as 'True', in a parallel test. That is, one would automatically remove those checks during parallel testing. One would want a compositional way of doing this, so that a postcondition might mix purely-functional checks (which can still fail) with monadic checks (which are just assumed to pass). It would be desirable to generate some kind of warning when this happens (e.g. X% of postconditions could not be checked). But it would allow a sequential model to be used immediately for parallel testing, which might already reveal many bugs even with the weaker postconditions, while supporting a gradual enrichment of the model (to reduce the "postcondition could not be checked" percentage).
What doesn't work well is to allow monadic postconditions and just check them at the wrong time... that leads to very hard to debug false positives, and there's no better way to put people off parallel testing.
@rjmh Thanks for the illuminating comment. Reading this
one might allow monadic postconditions, but interpret any postcondition that actually invokes an underlying monadic operation as 'True', in a parallel test.
seems to me advocating for a dedicated language to express conditions, where one of the constructors would allow monadic actions to occur, e.g much like the Rose type in QC which can hold pure values a or some monadic way for returning a Rose a? From my previous example:
data Postcondition m a = AssertEq a | ... | PostMonadic (m (Postcondition m a))
seems to me advocating for a dedicated language to express conditions, where one of the constructors would allow monadic actions to occur
data Postcondition m a = AssertEq a | ... | PostMonadic (m (Postcondition m a))
Not exactly. I think there's a lot of value in allowing arbitrary boolean expressions in postconditions... otherwise there will be a constant pressure to make the postcondition language richer. But I can imagine
data Postcondition m a = PostBool Bool
| PostMonitor (Property->Property) (Postcondition m a)
| PostMonadic (m (Postcondition m a))
where the Applicative instance would delay PostMonadic until after all the boolean checks and monitors. One could invoke the PostMonadic constructor via some offputting name, such as 'inSequentialTests', thus making it more obvious to the developer that there is a cost in using such things.
It's worth thinking about how to do the monitoring too. In failing parallel tests, after shrinking, the parallel parts are usually quite small and so there may be rather few interleavings. It may also be that the interleavings that fail last are the only interesting ones--ones that fail earlier can be considered as failing "because the interleaver guessed wrong". So potentially one might try and report each maximal failing interleaving separately, with counterexample information for each one.
Collecting statistics is a bit more difficult, since there may be many successful interleavings and we have no reason to prefer statistics from one over those from another. It feels wrong to collect statistics from all successful interleavings, because that would give some tests exponentially more weight than others in the overall statistics. Also it would be expensive... when many interleavings succeed, then there may be a big saving in only checking the first one.
I do agree there's value in allowing arbitrary boolean expressions in postconditions, but I also do think there's value in providing some kind of DSL for the common cases as this alleviates the need to for the user to write both the assertion/predicate and the counterexample explaining the predicate for the usual cases of (in)equality testing comparable values: both can be handled directly by the test driver.
Right, like (===) in the Property type. But that doesn't require a special constructor in the Postcondition type, just a library of functions for expressing checks-with-error-messages. For example, there is no special representation for (===) properties, but you get the behaviour you want anyway.
I agree this could be done as functions. With a dedicated language, you get some more possibilities like more compact representations for combinations. That's perhaps overkill 🤷
To summarise the above, there seems to be a couple of issues with the current definition of RunModel:
-
The type of
performis:perform :: Typeable a => state -> Action state a -> LookUp -> m (PerformResult (Error state) a)This prevents using this function in parallel testing as the state is not evolved during the execution. This function should not have access to the state:
perform :: Typeable a => Action state a -> LookUp -> m (PerformResult (Error state) a) -
Furthermore, I'm dubious about the
LookUpargument. It is unclear how results from previous commands would be of importance in the current performed action and not be already tracked by the SUT. Ineqc_statem,COMMAND_argsgenerate symbolic arguments during test-generation but those symbolic arguments are converted to dynamic ones on test execution, therefore callingCOMMAND/Nwith real arguments. I would actually argue that the type ofperformshould be:perform :: Typeable a => Action state a -> m (PerformResult (Error state) a)But that would require to replace any
Var ainAction state awith the actual value. PerhapsAction state ashould get another type parameter along the lines of:data Nature = Symbolic | Dynamic -- Nature is probably not the right term data ActionVar n a where SymbolicVar :: Var a -> ActionVar Symbolic a DynamicVar :: a -> ActionVar Dynamic aShould then
performbe:perform :: Typeable a => Action state Dynamic a -> m (PerformResult (Error state) a) -
The type of
postconditionis:postcondition :: (state, state) -> Action state a -> LookUp -> a -> PostconditionM m BoolWhich is a monad transformer on top of
m, therefore allowing the postcondition to run monadic actions, this postcondition should be pure as in parallel testing postconditions are not checked during execution but rather at the end of the execution:postcondition :: (state, state) -> Action state a -> LookUp -> a -> Boolwhich would be in-line with
COMMAND_post(S::dynamic_state(), Args::[term()], R::term()) :: bool()or as suggested, it could carry two parts, a monadic and a pure one. Sequential testing would run both, parallel testing would run only the pure one:
data Postcondition m = PostBool Bool | PostMonitor (Property -> Property) (Postcondition m) | PostMonadic (m (Postcondition m)) postcondition :: (state, state) -> Action state a -> LookUp -> a -> Postcondition m purePostcondition :: Postcondition m -> PropertyM m' Bool purePostcondition (PostMonadic _) = pure True purePostcondition (PostBool a) = pure a purePostcondition (PostMonitor m p) = monitor m >> purePostcondition p fullPostcondition :: Postcondition m -> PropertyM m Bool fullPostcondition (PostMonadic m) = run m >>= fullPostcondition fullPostcondition (PostBool a) = pure a fullPostcondition (PostMonitor m p) = monitor m >> purePostcondition p -
Again,
postconditiononly deals with dynamic states thus the utility ofLookUpalso looks dubious to me. I would argue that the type should be:postcondition :: state -> Action state Dynamic a -> a -> Postcondition m purePostcondition :: Postcondition m -> PropertyM m' Bool fullPostcondition :: Postcondition m -> PropertyM m Boolwhere one can use
nextStateto compute the now missing second component of the first argument.
Having the types above would make much easier to define the infrastructure for parallel testing, plus unifying the interfaces.
Did I get the resulting implications of the discussion above right?
I would also be interested in seeing cases where:
- The state is essential in
perform - LookUp is essential in
perform - Monadic operations are essential for
postcondition - LookUp is essential in
postcondition
In the Peras codebase we wrote some postcondition code which fulfills at least the last ~2~ 1 item of your list: We ~need a monadic action to retrieve some underlying state from the SUT, and we~ need Lookup to relate the result of the monadic action to a list of results previously returned which need to be resolved.
This could possibly be expressed differently, for example by adding some "observation" action, perhaps.
- https://github.com/input-output-hk/peras-design/blob/main/peras-quickcheck/src/Peras/NetworkModel.hs#L155
postcondition (_, Network{slot}) (ChainsHaveCommonPrefix chainVars) env () = do let chains = fmap env chainVars prefix = commonPrefix chains chainLength = length prefix chainDensity :: Integer = if slot == 0 then 0 else floor @Double $ fromIntegral chainLength * 1000 / fromIntegral slot counterexamplePost $ "Chains: " <> show chains counterexamplePost $ "Common prefix: " <> show prefix monitorPost $ tabulate "Prefix length" ["<= " <> show ((chainLength `div` 100 + 1) * 100)] monitorPost $ tabulate "Prefix density" ["<= " <> show (chainDensity `div` 10 + 1) <> "%"] -- FIXME: this 50 is arbitrary, should be related to some network parameter pure $ slot < 50 || not (null prefix) postcondition _ _ _ _ = pure True - https://github.com/input-output-hk/peras-design/blob/main/peras-quickcheck/src/Peras/NodeModel.hs#L136
postcondition (_before, NodeModel{slot}) (ForgedBlocksRespectSchedule blockVars) env stakeRatio | slot > 0 = do let blocks = foldMap env blockVars numberOfBlocks = length blocks counterexamplePost $ "Chain: " <> LT.unpack (decodeUtf8 (A.encode blocks)) produceExpectedNumberOfBlocks stakeRatio numberOfBlocks slot postcondition _ _ _ _ = pure True produceExpectedNumberOfBlocks :: Monad m => Rational -> Int -> Slot -> PostconditionM m Bool produceExpectedNumberOfBlocks stakeRatio blocks slot = do let expectedBP :: Double = fromRational $ stakeRatio * toRational (fromIntegral slot * defaultActiveSlotCoefficient) actualBP = fromIntegral blocks pure $ equalsBinomialWithinTails (fromIntegral slot) -- The sample size. (1 - (1 - defaultActiveSlotCoefficient) ** fromRational stakeRatio) -- Praos probability. 3 actualBP
In Hydra we don't use Lookup in the postconditions (see https://github.com/input-output-hk/hydra/blob/master/hydra-node/test/Hydra/Model.hs#L490) but we actually check postcondition in the perform with access to the monadic state of the SUT (https://github.com/input-output-hk/hydra/blob/master/hydra-node/test/Hydra/Chain/Direct/TxTraceSpec.hs#L202) which morally is a post-condition and therefore is equivalent to the 3rd requirement in your list.
There's been a discussion about this particular check and how it could have been written using a postcondition function, and this was problematic because the returned value from the action is not what's needed in the postcondition and it was considered cumbersome to change it.
I think for the first one, the "previously seen chains" should be part of the model. Morally I just think that the LookUp is just a fancier way of having a model that records every previous response. Precisely as eqc_statem says:
A correctly written COMMAND_next function does not inspect symbolic inputs--it just includes them as symbolic parts of the result.
And we are doing this all the time just by having an Env and a LookUp.
@jasagredo exactly how do you propose maintaining the phase separation between the generation time and the runtime if you can't have symbolic variables that are resolved at runtime?
If you don't propose getting rid of the phase distinction then there is fundamentally no difference between having perform be monadic and forcing the poor user to maintain the lookup function themselves, and simply providing the lookup function for them.
If you do propose getting rid of the phase distinction that's a non-starter.
The only difference between eqc and qcd is that in erlang you can easily generically traverse the actions and translate the variables under the hood. In Haskell implementing that is almost impossible to do nicely so you have to make the user do it themselves - hence the lookup.
I am just making explicit that having a LookUp and an Env is probably fine. I'm saying that "morally" that could be part of the model, in the sense that having an Env that automatically records responses and changing model to be (Model, Env) and make that recording explicit are both equivalent approaches.
I've always liked that quickcheck-dynamic gives you a LookUp, whereas with quickcheck-state-machine you would be writing the same boilerplate code handling symbolic/concrete references every time you set up a new model-based test-suite. Having only symbolic variables also makes the framework easier to understand IMO: variables are always symbolic, but the user can resolve variables to real results if they have a LookUp in scope.