vespa Cluster Crashes When Distribution Key Is Too High

Describe the bug The following change deployed successfully but crashed the entire Vespa cluster: From:

<group distribution-key="1" name="group1">
  <node distribution-key="124" hostalias="vespa10124"/>
</group>

To:

<group distribution-key="1" name="group1">
  <node distribution-key="124001" hostalias="vespa10124"/>
  <node distribution-key="124002" hostalias="vespa10124"/>
</group>

To Reproduce Steps to reproduce the behavior: Deploy a high distribution key such as 124001 and 124002.

Logs

[2024-04-18 17:13:42.140] INFO    container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.MasterElectionHandler	Cluster 'vespa1': 0 is new master candidate, but needs to wait before it can take over
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	java.lang.NullPointerException: Cannot invoke "com.yahoo.vespa.clustercontroller.core.NodeInfo.getRpcAddress()" because "node" is null
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.StateChangeHandler.handleNewRpcAddress(StateChangeHandler.java:222)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.handleNewRpcAddress(FleetController.java:337)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.rpc.SlobrokClient.updateCluster(SlobrokClient.java:144)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.lambda$resyncLocallyCachedState$15(FleetController.java:803)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.MetricUpdater.forWork(MetricUpdater.java:115)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.resyncLocallyCachedState(FleetController.java:803)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:521)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1031)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat java.base/java.lang.Thread.run(Thread.java:840)
[2024-04-18 17:13:42.172] ERROR   container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.FleetController	
	Cluster 'vespa1': Fatal error killed fleet controller
	exception=
	java.lang.NullPointerException: Cannot invoke "com.yahoo.vespa.clustercontroller.core.NodeInfo.getRpcAddress()" because "node" is null
	at com.yahoo.vespa.clustercontroller.core.StateChangeHandler.handleNewRpcAddress(StateChangeHandler.java:222)
	at com.yahoo.vespa.clustercontroller.core.FleetController.handleNewRpcAddress(FleetController.java:337)
	at com.yahoo.vespa.clustercontroller.core.rpc.SlobrokClient.updateCluster(SlobrokClient.java:144)
	at com.yahoo.vespa.clustercontroller.core.FleetController.lambda$resyncLocallyCachedState$15(FleetController.java:803)
	at com.yahoo.vespa.clustercontroller.core.MetricUpdater.forWork(MetricUpdater.java:115)
	at com.yahoo.vespa.clustercontroller.core.FleetController.resyncLocallyCachedState(FleetController.java:803)
	at com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:521)
	at com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1031)
	at java.base/java.lang.Thread.run(Thread.java:840)

Environment (please complete the following information):

OS: Linux version 4.18.0-372.9.1.el8.x86_64 ([email protected]) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Tue May 10 14:48:47 UTC 2022
Infrastructure: self-hosted

Vespa version 8.320.68

Apr 19 '24 13:04 yuvikk

@vekterli maybe we should consider better guiding on consequences of high distr key (ref the slack conversation) - the distr algo will slow down a lot with this, so maybe having a configurable upper bound is better, with a good error message at deploy

Apr 19 '24 14:04 kkraune

Yes, this particular use case (encoding host name patterns in the distribution keys) has been a recurring theme throughout the years. Doing so makes sense from an application modelling perspective, but makes the distribution algorithm give off blue smoke from burning CPU on generating pseudo-random numbers, and should therefore be discouraged.

As a start, we should certainly never allow deployments to pass validation when specifying distribution keys that exceed the internal type limits. Distribution keys are 16-bit integers internally, with UINT16_MAX treated as a special sentinel. So the valid distribution key range is never outside [0, UINT16_MAX - 1].

It would be fairly trivial to create a new version of the distribution algorithm that is O(|configured nodes|) rather than O(highest configured distribution key), but doing so in a backwards compatible manner is Complicated™️ at the best of times, which is the reason why it hasn't been done yet...

Apr 19 '24 14:04 vekterli

Two enhancements have been made to the application deployment logic to address this:

Deployment will fail if any node has a distribution key outside the allowed range. This prevents the cluster from falling over due to broken invariants.
For a cluster with valid distribution keys that are substantially higher than the number of configured nodes, a deployment performance warning is logged with a pointer to the distribution-key documentation.

Consequently, I'm marking this issue as closed.

May 29 '24 13:05 vekterli