druid
druid copied to clipboard
Major Service degradation despite replication when the pod is terminated abruptly due to node not ready issues
Affected versions
- All the users of Kubernetes extension would probably be affected by this
Description
- In the druid k8s extension the data nodes have lifecycle stage announcement which is executed before termination
- this particular stage would result in pod unnanouncement when it is in terminating state
- The effect of this is other druid master nodes and broker being aware of the termination and would stop assigning tasks/segments or routing queries to this particular pod
- In case of the node not ready issue the processing on pod stops abruptly, which results in no "unannouncement" being made to the other nodes
- Master nodes and brokers/routers would continue to detect this particular node which would result in high latency for the queries and monotonically increasing loadQueue count or in case of indexer high ingestion lag which is monotonically increasing
- This would go on till the issue is fixed for the indexers or the retention period has passed for the historical
reproducing the issue
- on a test cluster deployed druid where we would deliberately disable the unnanouncing the node in the end
- Also added sleep after the stop() along with increasing the gracefulTerminationPeriod which resulted in pod being in terminating state for a long time
So there are two paths that get impacted Query path - when the processing on pod stops abruptly, do the queries for this pod fail or do they get stuck? Ingestion path - The lag increase in this case is probably a false alarm. The lag may increase on that pod however the lag will be under control on the replica pod. My guess is that we end up reporting higher lag even if the replica is processing fast enough.
Right now, by default broker will distribute requests in the same proportion to the data nodes. There is a setting that lets you fan out requests such as broker will pick up the data node with less number of in-flight requests. Though, probably the right way to deal with this is broker detecting high-failure rate on a replica and blacklisting it for some time. That solves the problem for general failure scenarios. For this particular planned activity when node is marked not ready, we could probably do something that triggers the graceful termination code in pod.
I am not sure how though. It really depends on what control k8s offer.
There is a setting that lets you fan out requests such as broker will pick up the data node with less number of in-flight requests
I only see random, connection_count as setting available for the query routing. Connection count wouldnt have helped us in the failure that we have seen as the connections were more or less same on all the nodes