atlasdb icon indicating copy to clipboard operation
atlasdb copied to clipboard

Improve blacklisting methodology in face of bad cassandra node/performance

Open tpetracca opened this issue 5 years ago • 0 comments

See internal issue PDS-111266. During performance degradation of a single cassandra node, atlasdb today does a poor job of determining what the bad node is and blacklisting it from future query routing. In the specific cited example ~33% of all timeouts were driven by Atlas querying the known bad node.

Current state of world:

  • Cassandra is very good at maintain read latency (p99 specifically) in face of single-node degradation
  • Cassandra still times out more than desired on all nodes during single-node degradation; the bad node more-so than others, but the difference is less immediately clear

Idea: Use timeouts as an indicator for "I must do something to improve query routing", but actually use performance of queries from individual nodes as the methodology for stack ranking hosts and choosing who to exclude.

tpetracca avatar Feb 15 '20 19:02 tpetracca