elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Fixing remote master stability request when there has never been an elected master

Open masseyke opened this issue 2 years ago • 1 comments

This fixes an edge case in the master stability polling code from #89014. If there has not been an elected master node for the entire life of a non-master-eligible node, then clusterChanged() will have never been called on that node, so beginPollingRemoteMasterStabilityDiagnostic() will have never been called. And even though the node might know of some master-eligible nodes, it will never have requested diagnostic information from them. This PR adds a call to beginPollingRemoteMasterStabilityDiagnostic in CoordinationDiagnosticsService's constructor to cover this edge case. In almost all cases, clusterChanged() will be called within 10 seconds so the polling will never occur. However if there is no master node then there will be no cluster changed events, and clusterChanged() will not be called, and the results of the polling will likely be useful. This PR has several possibly controversial pieces of code. I'm listing them here with some discussion:

  1. Because there is now a call to beginPollingRemoteMasterStabilityDiagnostic() in the constructor, beginPollingRemoteMasterStabilityDiagnostic() is no longer solely called from the cluster change thread. However, this call happens before the object is registered as a cluster service listener, so there is no new thread safety concern.
  2. Because there is now a call to beginPollingRemoteMasterStabilityDiagnostic() in the constructor, we have to explicitly switch to the system context so that the various transport requests work in secure mode.
  3. When we're in the constructor, we don't actually know yet whether we're a master eligible node or not, so we kick off beginPollingRemoteMasterStabilityDiagnostic() for all node types, including master-eligible nodes. This will be fairly harmless for master eligible nodes though. In the worst case, they'll retrieve some information that they'll never use. This explains why clusterChanged() now cancels polling even if we are on a master eligible node.
  4. It is now possible that we use clusterService.state() before it is ready when we're trying to get the list of master-eligible peers. In production mode this method returns null, so we can check that before using it. If assertions are enabled in the JVM, just calling that method throws an AssertionError. I'm currently catching that with the assumption that it is harmless because there does not seem to be a way around it (without even further complicating code).
  5. It is now possible that we call transportService.sendRequest() before the transport service is ready. This happens if the server is initializing unusually slowly (i.e. it takes more than 10 seconds to complete the Node constructor) and if assertions are enabled. I don't see a way around this without further complicating the code, so I'm catching AssertionError and moving on, with the assumption that it will work 10 seconds later when it runs again. I'm also catching and storing Exception, which I think I should have been doing before anyway.

masseyke avatar Aug 09 '22 15:08 masseyke

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine avatar Aug 09 '22 15:08 elasticsearchmachine