elasticsearch
elasticsearch copied to clipboard
Fixing remote master stability request when there has never been an elected master
This fixes an edge case in the master stability polling code from #89014. If there has not been an elected master node for the entire life of a non-master-eligible node, then clusterChanged()
will have never been called on that node, so beginPollingRemoteMasterStabilityDiagnostic()
will have never been called. And even though the node might know of some master-eligible nodes, it will never have requested diagnostic information from them. This PR adds a call to beginPollingRemoteMasterStabilityDiagnostic
in CoordinationDiagnosticsService
's constructor to cover this edge case. In almost all cases, clusterChanged()
will be called within 10 seconds so the polling will never occur. However if there is no master node then there will be no cluster changed events, and clusterChanged()
will not be called, and the results of the polling will likely be useful.
This PR has several possibly controversial pieces of code. I'm listing them here with some discussion:
- Because there is now a call to
beginPollingRemoteMasterStabilityDiagnostic()
in the constructor,beginPollingRemoteMasterStabilityDiagnostic()
is no longer solely called from the cluster change thread. However, this call happens before the object is registered as a cluster service listener, so there is no new thread safety concern. - Because there is now a call to
beginPollingRemoteMasterStabilityDiagnostic()
in the constructor, we have to explicitly switch to the system context so that the various transport requests work in secure mode. - When we're in the constructor, we don't actually know yet whether we're a master eligible node or not, so we kick off
beginPollingRemoteMasterStabilityDiagnostic()
for all node types, including master-eligible nodes. This will be fairly harmless for master eligible nodes though. In the worst case, they'll retrieve some information that they'll never use. This explains whyclusterChanged()
now cancels polling even if we are on a master eligible node. - It is now possible that we use
clusterService.state()
before it is ready when we're trying to get the list of master-eligible peers. In production mode this method returns null, so we can check that before using it. If assertions are enabled in the JVM, just calling that method throws anAssertionError
. I'm currently catching that with the assumption that it is harmless because there does not seem to be a way around it (without even further complicating code). - It is now possible that we call
transportService.sendRequest()
before the transport service is ready. This happens if the server is initializing unusually slowly (i.e. it takes more than 10 seconds to complete theNode
constructor) and if assertions are enabled. I don't see a way around this without further complicating the code, so I'm catchingAssertionError
and moving on, with the assumption that it will work 10 seconds later when it runs again. I'm also catching and storingException
, which I think I should have been doing before anyway.
Pinging @elastic/es-data-management (Team:Data Management)