atlasdb
atlasdb copied to clipboard
Improve blacklisting methodology in face of bad cassandra node/performance
See internal issue PDS-111266. During performance degradation of a single cassandra node, atlasdb today does a poor job of determining what the bad node is and blacklisting it from future query routing. In the specific cited example ~33% of all timeouts were driven by Atlas querying the known bad node.
Current state of world:
- Cassandra is very good at maintain read latency (p99 specifically) in face of single-node degradation
- Cassandra still times out more than desired on all nodes during single-node degradation; the bad node more-so than others, but the difference is less immediately clear
Idea: Use timeouts as an indicator for "I must do something to improve query routing", but actually use performance of queries from individual nodes as the methodology for stack ranking hosts and choosing who to exclude.