atlasdb
atlasdb copied to clipboard
Cassandra KVS: Blacklisting for Individual Node Slowness
Internal reference: PDS-92632
Need to be careful with this one. Quoting my comment on the ticket,
- The mechanics of blacklisting in AtlasDB are that requests will be routed to the other two nodes that have the data being requested. This would be useful in this specific case, but unfortunately AtlasDB sometimes receives requests that are inherently resource-intensive or possibly conflicting (e.g. range scan the world, certain kinds of CAS operations) and it's very difficult to determine these a priori. Blacklisting on the basis of timeouts could eventually result in unbalanced load and/or the entire cluster being blacklisted meaning that queries are made completely at random, which means we'll lose the token ring-aware performance optimisations in Atlas. That last point would also hold in the presence of cluster-wide slowness - we'd likely increase the overall load on Cassandra.
- A TimedOutException occurring on some node N doesn't necessarily indicate a problem with N, because internally N might need to contact another node N' that is actually being slow, e.g. if N' is the 'owner' of the token that defines the interval in which the data lives (see the read path section in Cassandra ArchitectureInternals).
To address this, I think it may be reasonable to implement a background task that polls the cluster with a query that is known to be fast and doesn't rely on communication with other nodes (maybe a nodetool operation). We can then detect outliers and blacklist them as that would avoid both of the failure cases mentioned above.