snabbkaffe icon indicating copy to clipboard operation
snabbkaffe copied to clipboard

Collection of utilities for trace-based testing

#+TITLE: Snabbkaffe

[[https://travis-ci.org/klarna/snabbkaffe.svg?branch=master]]

  • Introduction

This library provides functions for simple trace-based testing.

It works like this:

  1. Programmer manually instruments the code with trace points
  2. Testcases are split in two parts:
    • Run stage where the program runs and emits event trace
    • Check stage where trace is collected and validated against the spec(s)
  3. Trace points become ordinary log messages in the release build

This approach can be used in a component test involving an ensemble of interacting processes. It has a few nice properties:

  • Checks can be separated from the program execution
  • Checks are independent from each other and fully composable
  • Trace contains complete history of the process execution, thus making certain types of concurrency bugs, like livelocks, easier to detect
  • Usage

** Instrumenting the code

Code instrumentation is done manually by inserting =tp= macros at the points of interest:

#+BEGIN_SRC erlang ?tp(my_server_state_change, #{old_state => foo, next_state => bar}) #+END_SRC

The above line of code, when compiled in test mode, emits an event of /kind/ =my_server_state_change=, with some additional data specified in the second argument. Any event has a /kind/, which is an atom identifying type of the event. The second argument must be a map, and it can contain any data.

In the release build this macro will become a [[https://github.com/tolbrino/hut/][hut]] log message with =debug= level. One can as well tune trace point log level:

#+BEGIN_SRC erlang ?tp(notice, my_server_state_change, #{old_state => foo, next_state => bar}) #+END_SRC

** Collecting trace

Trace collector process must be started before running the test. Full workflow looks like this:

#+BEGIN_SRC erlang ok = snabbkaffe:start_trace(), Return = RunYourCode(), Trace = snabbkaffe:collect_trace(Timeout), snabbkaffe:stop(), RunCheck1(Return, Trace), RunCheck2(Return, Trace), ... RunCheckN(Return, Trace). #+END_SRC

Note that =collect_trace= function is destructive: it cleans event buffer of the collector process. Its argument =Timeout= specifies how long the call will wait after the last event is received. Setting this parameter to a non-zero value is useful when /run stage/ starts some asynchronous tasks. By default =Timeout= is 0.

** Checking traces

Trace is just a list of maps. Any standard function can work with it. Nonetheless, /Snabbkaffe/ comes with a few useful macros for trace analysis.

Select events of certain kind:

#+BEGIN_SRC erlang ?of_kind(foo, Trace) #+END_SRC

Extract values of a certain event field(s) to a list:

#+BEGIN_SRC erlang [1, 2] = ?projection(foo, [#{foo => 1, quux => 1}, #{foo => 2}]),

[{1, 1}, {2, 2}] = ?projection([foo, bar], [#{foo => 1, bar => 1}, #{foo => 2, bar => 2}]) #+END_SRC

Check that events occur in certain order, and throw an exception otherwise:

#+BEGIN_SRC erlang ?causality( #{?snk_kind := foo, foo := _A} , #{?snk_kind := bar, foo := _A} , Trace ) #+END_SRC

First argument of =?causality= macro matches an event that is called the /cause/, and the second one matches so called /effect/. The above example checks events of kind =bar= occur only after corresponding events of kind =foo=.

This macro can be extended with a guard as well. Here guard checks that two events actually make up a pair:

#+BEGIN_SRC erlang ?causality( #{?snk_kind := foo, foo := _A} , #{?snk_kind := bar, foo := _B} , _A + 1 =:= _B , Trace ) #+END_SRC

There is a version of the above macro that checks that all cause have an effect:

#+BEGIN_SRC erlang ?strict_causality( #{?snk_kind := foo, foo := _A} , #{?snk_kind := bar, foo := _A} , Trace ) #+END_SRC

Otherwise it works just like =?causality=.

Both =?causality= and =?strict_causality= are actually based on a more powerful =?find_pairs= macro that is invoked like this:

#+BEGIN_SRC erlang ?find_pairs( Strict , MatchCause , MatchEffect [, Guard] , Trace ) #+END_SRC

where =Strict= is a boolean that determines whether events that matched as =Effect= may precede their cause.

It returns a list of tuples of type =snabbkaffe:maybe_pair/0= that is defined like that:

#+BEGIN_SRC erlang -type maybe_pair() :: {pair, event(), event()} | {singleton, event()}. #+END_SRC

** Gathering it all together

=?check_trace= is a convenience wrapper that starts the trace collector process, executes /run stage/, collects traces and then executes /check stage/:

#+BEGIN_SRC erlang ?check_trace(begin RunStage end, fun(ReturnValue, Trace) -> CheckStage end). #+END_SRC

There is an extended version of this macro that takes additional configuration as the first argument:

#+BEGIN_SRC erlang ?check_trace(#{timeout => Timeout, bucket => Bucket}, begin RunStage end, fun(ReturnValue, Trace) -> CheckStage end). #+END_SRC

or:

#+BEGIN_SRC erlang ?check_trace(Bucket, begin RunStage end, fun(ReturnValue, Trace) -> CheckStage end). #+END_SRC

=Bucket= is a parameter used for benchmarking, more on that later.

** Blocking execution of testcase until certain event is emitted

Even though philosophy of this library lies in separation of run and verify stages, sometimes the former needs to be somewhat aware of the events. For example, the testcase may need to wait for asynchronous initialization of some resource.

In this case =?block_until= macro should be used. It allows the testcase to peek into the trace. Example usage:

#+BEGIN_SRC erlang ?block_until(#{?snk_kind := Kind}, Timeout, BackInTime) #+END_SRC

Note: it's tempting to use this macro to check the result of some asynchronous action, like this:

#+BEGIN_SRC erlang {ok, Pid} = foo:async_init(), {ok, Event} = ?block_until(#{?snk_kind := foo_init, pid := Pid}), do_stuff(Pid) #+END_SRC

However it's not a good idea, because the event can be emitted before =?block_until= has a chance to run. Use the following macro to avoid this race condition:

#+BEGIN_SRC {{ok, Pid}, {ok, Event}} = ?wait_async_action( foo:async_init() , #{?snk_kind := foo_init, pid := Pid} ), do_stuff(Pid) #+END_SRC

** Declarative fault injection

Any trace point can also be used to inject crashes into the system. This is extremely useful for testing fault-tolerance properties of the system and tuning the supervision trees. This is done using =?inject_crash= macro, like in the below example:

#+BEGIN_SRC erlang FaultId = ?inject_crash( #{?snk_kind := some_kind, value := 42} % Pattern for matching trace points , snabbkaffe_nemesis:always_crash() % Fault scenario , notmyday % Error reason ) #+END_SRC

Running this command in the run stage of the testcase will ensure that every time the system tries to emit a trace event matching the pattern, the system will crash with a reason =notmyday=, and emit a trace event of kind =snabbkaffe_crash=.

First argument of the macro is a pattern that is used for matching trace events. Second argument is a "fault scenario", that determines how often the system should fail. The following scenarios are implemented:

  • =snabbkaffe_nemesis:always_crash()= -- always crash, emulates unrecoverable errors
  • =snabbkaffe_nemesis:recover_after(N)= -- crash =N= times, and then proceed normally, emulates recoverable errors
  • =snabbkaffe_nemesis:random_crash(P)= -- crash in a pseudo-random pattern with probability =P=, emulates an unreliable resource
  • =snabbkaffe_nemesis:periodic_crash(Period, DutyCycle, Phase)= -- crash periodically, like this: =[ok, ok, ok, crash, crash, ok, ok, ok, crash, crash|...]=
    • =Period= is an integer that specifies period of the crash-recover cycle
    • =DutyCycle= is a float in =[0..1]= range, that specifies relative amount of time when the trace point is /not/ crushing. (For example, 1 means the system doesn't crash, and 0 means it always crashes)
    • =Phase= is a float in =[0..2*math:pi()]= range that allows to shift the phase of the periodic scenario

Finally, the third argument is a crash reason. It is optional, and defaults to the atom =notmyday=.

Please note that fault scenarios work independently for each /trace point/. E.g. if there are two trace point that both match the same fault injection pattern with =recover_after= scenario, they will recover at different times.

Later =snabbkaffe_nemesis:fix_crash(FaultId)= call can be used to delete the injected crash.

** PropER integration

There are two useful macros for running /snabbkaffe/ together with [[https://proper-testing.github.io/][propER]]:

#+BEGIN_SRC erlang Config = [{proper, #{ numtests => 100 , timeout => 5000 , max_size => 100 }}, ...], ?run_prop(Config, PROP) #+END_SRC

=Config= parameter should be a proplist or a map, that (optionally) contains =proper= key. It can be used to pass different parameters to proper. Snabbkaffe will fall back to the default values (shown above) when parameter is absent.

=PROP= is a proper spec that looks something like this:

#+BEGIN_SRC erlang ?FORALL({Ret, L}, {term(), list()}, ?check_trace( begin RunStage end, fun(Return, Trace) -> CheckStage end)) #+END_SRC

There is another macro for the most common type of proper checks where property is a simple =?FORALL= clause (like in the above example).

#+BEGIN_SRC erlang ?forall_trace({Ret, L}, {term(), list()}, begin RunStage end, fun(Return, Trace) -> CheckStage end) #+END_SRC

It combines =?FORALL= and =?run_prop=.

** Concuerror support

Snabbkaffe has (highly) experimental support for [[https://concuerror.com][Concuerror]]. It requires recompiling this library with special options, so creating a special build profile is recommended. This can be done by adding the following code to the =rebar.config=:

#+BEGIN_SRC erlang {profiles, [ {concuerror, [ {overrides, [{add, [{erl_opts, [ {d, 'CONCUERROR'} , {d, 'HUT_IOFORMAT'} ]}]}]} ]} ]}. #+END_SRC

Run concuerror with the following flags:

#+BEGIN_SRC bash $(CONCUERROR) --treat_as_normal shutdown --treat_as_normal normal
-x code -x code_server -x error_handler
--pa $(BUILD_DIR)/concuerror+test/lib/snabbkaffe/ebin #+END_SRC

P.S. Again, this feature is experimental, use at your own risk.

  • Benchmarking

/Snabbkaffe/ automatically adds timestamps to the events, which makes it a very unscientific benchmarking library.

There is a family of functions for reporting metric data.

Report a scalar metric called =my_metric1=:

#+BEGIN_SRC erlang snabbkaffe:push_stat(my_metric1, 42), snabbkaffe:push_stats(my_metric1, [42, 43, 42]), %% Or even: snabbkaffe:push_stats(my_metric1, [{pair, Event1, Event2}, {pair, Event3, Event4}, ...]), #+END_SRC

Sometimes it is entertaining to see how metric value depends on the size of the input data:

#+BEGIN_SRC erlang snabbkaffe:push_stat(my_metric1, SizeOfData, 42), snabbkaffe:push_stats(my_metric1, SizeOfData, [42, 43, 42]) #+END_SRC

Metrics can be reported by calling =snabbkaffe:analyze_statistics/0= function that prints statistics for each reported metric, like in the above example:

#+BEGIN_EXAMPLE

foo_bar statistics: [{min,9.999999999999999e-6}, {max,9.999999999999999e-6}, {arithmetic_mean,1.000000000000002e-5}, {geometric_mean,1.0000000000000123e-5}, {harmonic_mean,9.999999999999997e-6}, {median,9.999999999999999e-6}, {variance,4.174340734454146e-40}, {standard_deviation,2.0431203426264804e-20}, {skewness,-0.9850375627355535}, {kurtosis,-2.0199000000000003}, {percentile,[{50,9.999999999999999e-6}, {75,9.999999999999999e-6}, {90,9.999999999999999e-6}, {95,9.999999999999999e-6}, {99,9.999999999999999e-6}, {999,9.999999999999999e-6}]}, {histogram,[{9.999999999999999e-6,100}]}, {n,100}]

Statisitics of test 100.479087 ^ * | * | * | * | | * | * | | * | * | * | | * | * 0 +---------------------------------------------------------------------> 0 1100

     N    min         max        avg
   110 1.23984e+0 1.09774e+1 5.97581e+0
   209 1.10121e+1 2.08884e+1 1.60011e+1
   308 2.13004e+1 3.09071e+1 2.60224e+1
   407 3.10212e+1 4.09074e+1 3.59904e+1
   506 4.10095e+1 5.09904e+1 4.60456e+1
   605 5.11370e+1 6.08557e+1 5.60354e+1
   704 6.10493e+1 7.09071e+1 6.59642e+1
   803 7.11237e+1 8.07733e+1 7.59588e+1
   902 8.10944e+1 9.09766e+1 8.60179e+1
  1001 9.10459e+1 9.99404e+1 9.54548e+1
  1100 1.00004e+2 1.00939e+2 1.00479e+2

#+END_EXAMPLE

Note: =?run_prop= does this automatically.