ex_unit_clustered_case icon indicating copy to clipboard operation
ex_unit_clustered_case copied to clipboard

Implement fault injection helpers

Open bitwalker opened this issue 7 years ago • 3 comments

It's often desirable to test that some behavior in the application is consistent in the face of random faults, or that it matches the expected behavior. The nature of these faults is often application specific, though there are some faults which can be offered out of the box (i.e. random partitions).

An API is needed which supports injecting faults in useful ways, and supports arbitrary extension (some constraints may be needed, but I'm not yet sure what those are).

  • [ ] First, need to figure out what types of faults apply to applications in general
  • [ ] Second, determine what type of faults may be specific to a few different applications, and then find traits in common so we can work towards an API which allows both applications to implement fault injection in the ways they are needed

bitwalker avatar Jul 22 '18 21:07 bitwalker

A couple of suggestions as conversation starters:

  • latency
  • timeouts
  • unreachable nodes
  • crashed nodes (different behavior may be desirable in crashed vs. unreachable state)
  • flapping
  • unhandled messages
  • lib/app failure (supervised app failures)
  • crashed procs

beardedeagle avatar Jul 24 '18 02:07 beardedeagle

Those are all great suggestions! Here are my thoughts:

  • Latency, timeouts, and dropped messages can all be handled by implementing an alternative distribution carrier I think, and running all of the slave nodes with that carrier rather than the default. We'd need to communicate with the carrier in order to enable/disable faults of a given type for a specific node or pair of nodes. Unfortunately we can't match on specific messages to fault on, since by the time data gets to the carrier, it's already been encoded for transport by the runtime (perhaps we can reverse the encoding though, I haven't got an implementation yet which would let me experiment with that, but I suspect it's not as simple as :erlang.binary_to_term/1)
  • We can use the new partitioning API to make nodes unreachable
  • We can use the stop or kill functions added to Node along with the :heart option to allow crashing nodes, and by looping on them, flapping nodes. We can also use the partitioning API to support an alternative flapping behavior, where the network is flaky rather than some node.

In my opinion, application failure or crashed processes would fall under "user defined" faults, i.e. using Cluster.call(node, fn -> exit(some_important_process, :kill) end).

One thing I'm trying to figure out is whether we need a specific API for "injecting" faults, or whether we just need to offer the pieces (i.e. partition, kill, call, etc.) and let users set up faults the way they need. The class of faults which make me wonder about that, are those which would require the alternative distribution carrier, since those are all variants of the same class (network performance or outright failure), and I'm not yet sure what to call that API or how it would look.

bitwalker avatar Jul 24 '18 16:07 bitwalker

I would agree with the "user defined" faults bit. As for injecting faults, I think the pieces with reasonable docs would be good enough to be honest.

beardedeagle avatar Jul 31 '18 01:07 beardedeagle