vespa
vespa copied to clipboard
Cluster Crashes When Distribution Key Is Too High
Describe the bug The following change deployed successfully but crashed the entire Vespa cluster: From:
<group distribution-key="1" name="group1">
<node distribution-key="124" hostalias="vespa10124"/>
</group>
To:
<group distribution-key="1" name="group1">
<node distribution-key="124001" hostalias="vespa10124"/>
<node distribution-key="124002" hostalias="vespa10124"/>
</group>
To Reproduce Steps to reproduce the behavior: Deploy a high distribution key such as 124001 and 124002.
Logs
[2024-04-18 17:13:42.140] INFO container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.MasterElectionHandler Cluster 'vespa1': 0 is new master candidate, but needs to wait before it can take over
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr java.lang.NullPointerException: Cannot invoke "com.yahoo.vespa.clustercontroller.core.NodeInfo.getRpcAddress()" because "node" is null
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.StateChangeHandler.handleNewRpcAddress(StateChangeHandler.java:222)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.FleetController.handleNewRpcAddress(FleetController.java:337)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.rpc.SlobrokClient.updateCluster(SlobrokClient.java:144)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.FleetController.lambda$resyncLocallyCachedState$15(FleetController.java:803)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.MetricUpdater.forWork(MetricUpdater.java:115)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.FleetController.resyncLocallyCachedState(FleetController.java:803)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:521)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1031)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat java.base/java.lang.Thread.run(Thread.java:840)
[2024-04-18 17:13:42.172] ERROR container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.FleetController
Cluster 'vespa1': Fatal error killed fleet controller
exception=
java.lang.NullPointerException: Cannot invoke "com.yahoo.vespa.clustercontroller.core.NodeInfo.getRpcAddress()" because "node" is null
at com.yahoo.vespa.clustercontroller.core.StateChangeHandler.handleNewRpcAddress(StateChangeHandler.java:222)
at com.yahoo.vespa.clustercontroller.core.FleetController.handleNewRpcAddress(FleetController.java:337)
at com.yahoo.vespa.clustercontroller.core.rpc.SlobrokClient.updateCluster(SlobrokClient.java:144)
at com.yahoo.vespa.clustercontroller.core.FleetController.lambda$resyncLocallyCachedState$15(FleetController.java:803)
at com.yahoo.vespa.clustercontroller.core.MetricUpdater.forWork(MetricUpdater.java:115)
at com.yahoo.vespa.clustercontroller.core.FleetController.resyncLocallyCachedState(FleetController.java:803)
at com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:521)
at com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1031)
at java.base/java.lang.Thread.run(Thread.java:840)
Environment (please complete the following information):
- OS:
Linux version 4.18.0-372.9.1.el8.x86_64 ([email protected]) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Tue May 10 14:48:47 UTC 2022
- Infrastructure: self-hosted
Vespa version 8.320.68
@vekterli maybe we should consider better guiding on consequences of high distr key (ref the slack conversation) - the distr algo will slow down a lot with this, so maybe having a configurable upper bound is better, with a good error message at deploy
Yes, this particular use case (encoding host name patterns in the distribution keys) has been a recurring theme throughout the years. Doing so makes sense from an application modelling perspective, but makes the distribution algorithm give off blue smoke from burning CPU on generating pseudo-random numbers, and should therefore be discouraged.
As a start, we should certainly never allow deployments to pass validation when specifying distribution keys that exceed the internal type limits. Distribution keys are 16-bit integers internally, with UINT16_MAX
treated as a special sentinel. So the valid distribution key range is never outside [0, UINT16_MAX - 1]
.
It would be fairly trivial to create a new version of the distribution algorithm that is O(|configured nodes|) rather than O(highest configured distribution key), but doing so in a backwards compatible manner is Complicated™️ at the best of times, which is the reason why it hasn't been done yet...
Two enhancements have been made to the application deployment logic to address this:
- Deployment will fail if any node has a distribution key outside the allowed range. This prevents the cluster from falling over due to broken invariants.
- For a cluster with valid distribution keys that are substantially higher than the number of configured nodes, a deployment performance warning is logged with a pointer to the
distribution-key
documentation.
Consequently, I'm marking this issue as closed.