druid icon indicating copy to clipboard operation
druid copied to clipboard

Major Service degradation despite replication when the pod is terminated abruptly due to node not ready issues

Open rbankar7 opened this issue 6 months ago • 2 comments

Affected versions

  • All the users of Kubernetes extension would probably be affected by this

Description

  • In the druid k8s extension the data nodes have lifecycle stage announcement which is executed before termination
  • this particular stage would result in pod unnanouncement when it is in terminating state
  • The effect of this is other druid master nodes and broker being aware of the termination and would stop assigning tasks/segments or routing queries to this particular pod
  • In case of the node not ready issue the processing on pod stops abruptly, which results in no "unannouncement" being made to the other nodes
  • Master nodes and brokers/routers would continue to detect this particular node which would result in high latency for the queries and monotonically increasing loadQueue count or in case of indexer high ingestion lag which is monotonically increasing
  • This would go on till the issue is fixed for the indexers or the retention period has passed for the historical

reproducing the issue

  • on a test cluster deployed druid where we would deliberately disable the unnanouncing the node in the end
  • Also added sleep after the stop() along with increasing the gracefulTerminationPeriod which resulted in pod being in terminating state for a long time

rbankar7 avatar Jun 06 '25 19:06 rbankar7

So there are two paths that get impacted Query path - when the processing on pod stops abruptly, do the queries for this pod fail or do they get stuck? Ingestion path - The lag increase in this case is probably a false alarm. The lag may increase on that pod however the lag will be under control on the replica pod. My guess is that we end up reporting higher lag even if the replica is processing fast enough.

Right now, by default broker will distribute requests in the same proportion to the data nodes. There is a setting that lets you fan out requests such as broker will pick up the data node with less number of in-flight requests. Though, probably the right way to deal with this is broker detecting high-failure rate on a replica and blacklisting it for some time. That solves the problem for general failure scenarios. For this particular planned activity when node is marked not ready, we could probably do something that triggers the graceful termination code in pod.

I am not sure how though. It really depends on what control k8s offer.

abhishekagarwal87 avatar Jun 16 '25 07:06 abhishekagarwal87

There is a setting that lets you fan out requests such as broker will pick up the data node with less number of in-flight requests

I only see random, connection_count as setting available for the query routing. Connection count wouldnt have helped us in the failure that we have seen as the connections were more or less same on all the nodes

rbankar7 avatar Jun 16 '25 09:06 rbankar7