elasticsearch
elasticsearch copied to clipboard
Users are confused by disconnected remote clusters
If a local cluster connected to a remote cluster, and the remote cluster goes down and is then recovered, the local cluster's Remote cluster info API shows the remote cluster as disconnected until:
- Either a node disconnects and some other node is available at the time of disconnection
- Or there are no connected nodes and a cross-cluster request is send from the local cluster
It seems users expect to use the Kibana Remote Clusters UI as an overview of the status of remote clusters, which is an unmet expectation due to this behavior. We can attempt to clarify/address the behavior in the UI (https://github.com/elastic/kibana/issues/116961), but I wonder if there's a way to address this on a fundamental level? Can a local cluster and remote cluster "reconnect" as they are brought offline and back online without requiring the above events?
CC @DaveCTurner
Pinging @elastic/es-distributed (Team:Distributed)
We briefly took a look at this with @tlrx . We saw there is a cluster.remote.<cluster_alias>.transport.ping_schedule setting that defaults to -1 because of TCP keep-alives. But even this setting might not understand if a remote cluster becomes connectable again after some time. We may need to investigate if we can implement some periodic connection retry method.
I'm also seeing this issue among our users. Progress here and on the linked Kibana issue would be welcome!
We may need to investigate if we can implement some periodic connection retry method.
I wonder if we should call RemoteClusterConnection#ensureConnected during the execution of GET _remote/info. It'd need to time out reasonably quickly (30s? 10s? 5s?) in case the remote cluster is still unresponsive, but that would make the response less surprising in the case that the remote cluster is available again. Something like this perhaps?
public void getRemoteConnectionInfos(ActionListener<List<RemoteConnectionInfo>> listener, TimeValue timeout) {
final ListenableFuture<Void> completionListener = new ListenableFuture<>();
final Runnable onTimeout;
try (var refs = new RefCountingRunnable(() -> completionListener.onResponse(null))) {
final List<ActionListener<Void>> connectedListeners = new ArrayList<>(remoteClusters.size());
final List<RemoteConnectionInfo> results = Collections.synchronizedList(new ArrayList<>(remoteClusters.size()));
completionListener.addListener(listener.map(ignored -> List.copyOf(results)));
for (final var remoteClusterConnection : remoteClusters.values()) {
final var ref = refs.acquire();
final var connectedListener = ActionListener.notifyOnce(ActionListener.<Void>running(() -> {
try (ref) {
results.add(remoteClusterConnection.getConnectionInfo());
}
}));
connectedListeners.add(connectedListener);
remoteClusterConnection.ensureConnected(connectedListener);
}
onTimeout = () -> ActionListener.onResponse(connectedListeners, null);
}
if (completionListener.isDone() == false) {
ActionListener.run(completionListener, l -> {
final var cancellable = transportService.getThreadPool().schedule(onTimeout, timeout, ThreadPool.Names.SAME);
l.addListener(ActionListener.running(cancellable::cancel));
});
}
}
A customer mentioned this today in a call: "Is there a CCS keep-alive built into the product? For instance, we ran into issues where the networking changed but CCS status still showed healthy. Which indicates the status isn't an active status. So two things...
- Monitor TCP/IP keep-alives
- Build in a application level poll mechanism to ensure status is accurate"
the networking changed but CCS status still showed healthy
Not really sure what this means. Elasticsearch already sends (and therefore monitors) TCP keepalives by default, so it will detect a lost connection reasonably quickly, and it will also auto-reconnect on a disconnect, so we would expect it to notice a change in the network and remain healthy.
There is also an application-level polling mechanism but the facilities provided by the OS (especially TCP keepalives and proper retransmission config) are always going to do a better job than anything we can do in userspace.