observability-workshop
observability-workshop copied to clipboard
Identify the less observable fault
Identified the fault we want to inject
-
Requirements for issue:
- Not able to be debugged from UI
- Not able to be identified in metrics, logs, or traces
- A pretty clear gap in observability
- Pretty obscure probably as that is when observability is most necessary (but needs to be something common enough that we it’s not “works on my machine”)
-
Possible ideas:
- Can we create a service that turns into a black hole pretty frequently? The idea being this emmits no diagnostics and fails in a frustrating manner. For example, accepts incoming requests but never returns.
- Issue is exacerbated on a single node/vm
- docker chaos degrading traffic between a couple nodes
-
Documented way to create it
-
Documented way to fix it
-
Example graphs/alerts to identify the issue
-
Example how to increase observability to track issue