CASSANDRA-14459: DynamicEndpointSnitch should never prefer latent replicas

Open jolynch opened this issue 7 years ago • 0 comments

This change incorporates the feedback from Ariel and Jason as part of https://issues.apache.org/jira/browse/CASSANDRA-14459.

The following is introduced:

Fully pluggable DynamicEndpointSnitch so that we can continue experimenting with new implementations
Instead of resetting every 10 minutes, the DES uses active latency probes for replicas that it was asked to rank but has no recent data on. These are rate limited by default to a single probe per second. These latency probes, while not perfect, will correctly detect nodes that are latent due to network conditions, JVM instability (gc/safepoint pauses), and Read threadpool exhaustion.
A new opt-in implementation of the DES which uses an exponential moving average instead of a Histogram. Both statistical measures try to develop a noise reduced sample with different tradeoffs, but the main one in favor of DES is that it reacts to extreme outliars faster (e.g. if a node is actively timing out and dropping messages) and generates about 100x less garbage than the histogram approach.

Oct 14 '18 22:10 jolynch