elasticsearch Users are confused by disconnected remote clusters

If a local cluster connected to a remote cluster, and the remote cluster goes down and is then recovered, the local cluster's Remote cluster info API shows the remote cluster as disconnected until:

Either a node disconnects and some other node is available at the time of disconnection
Or there are no connected nodes and a cross-cluster request is send from the local cluster

It seems users expect to use the Kibana Remote Clusters UI as an overview of the status of remote clusters, which is an unmet expectation due to this behavior. We can attempt to clarify/address the behavior in the UI (https://github.com/elastic/kibana/issues/116961), but I wonder if there's a way to address this on a fundamental level? Can a local cluster and remote cluster "reconnect" as they are brought offline and back online without requiring the above events?

CC @DaveCTurner

Nov 01 '21 15:11 cjcenizal

Pinging @elastic/es-distributed (Team:Distributed)

Nov 01 '21 15:11 elasticmachine

We briefly took a look at this with @tlrx . We saw there is a cluster.remote.<cluster_alias>.transport.ping_schedule setting that defaults to -1 because of TCP keep-alives. But even this setting might not understand if a remote cluster becomes connectable again after some time. We may need to investigate if we can implement some periodic connection retry method.

Aug 09 '22 09:08 kingherc

I'm also seeing this issue among our users. Progress here and on the linked Kibana issue would be welcome!

Sep 09 '22 20:09 mbarretta

We may need to investigate if we can implement some periodic connection retry method.

I wonder if we should call RemoteClusterConnection#ensureConnected during the execution of GET _remote/info. It'd need to time out reasonably quickly (30s? 10s? 5s?) in case the remote cluster is still unresponsive, but that would make the response less surprising in the case that the remote cluster is available again. Something like this perhaps?

public void getRemoteConnectionInfos(ActionListener<List<RemoteConnectionInfo>> listener, TimeValue timeout) {
    final ListenableFuture<Void> completionListener = new ListenableFuture<>();
    final Runnable onTimeout;
    try (var refs = new RefCountingRunnable(() -> completionListener.onResponse(null))) {
        final List<ActionListener<Void>> connectedListeners = new ArrayList<>(remoteClusters.size());
        final List<RemoteConnectionInfo> results = Collections.synchronizedList(new ArrayList<>(remoteClusters.size()));
        completionListener.addListener(listener.map(ignored -> List.copyOf(results)));
        for (final var remoteClusterConnection : remoteClusters.values()) {
            final var ref = refs.acquire();
            final var connectedListener = ActionListener.notifyOnce(ActionListener.<Void>running(() -> {
                try (ref) {
                    results.add(remoteClusterConnection.getConnectionInfo());
                }
            }));
            connectedListeners.add(connectedListener);
            remoteClusterConnection.ensureConnected(connectedListener);
        }
        onTimeout = () -> ActionListener.onResponse(connectedListeners, null);
    }
    if (completionListener.isDone() == false) {
        ActionListener.run(completionListener, l -> {
            final var cancellable = transportService.getThreadPool().schedule(onTimeout, timeout, ThreadPool.Names.SAME);
            l.addListener(ActionListener.running(cancellable::cancel));
        });
    }
}

Mar 11 '23 10:03 DaveCTurner

A customer mentioned this today in a call: "Is there a CCS keep-alive built into the product? For instance, we ran into issues where the networking changed but CCS status still showed healthy. Which indicates the status isn't an active status. So two things...

Monitor TCP/IP keep-alives
Build in a application level poll mechanism to ensure status is accurate"

Aug 31 '23 18:08 tylerperk

the networking changed but CCS status still showed healthy

Not really sure what this means. Elasticsearch already sends (and therefore monitors) TCP keepalives by default, so it will detect a lost connection reasonably quickly, and it will also auto-reconnect on a disconnect, so we would expect it to notice a change in the network and remain healthy.

There is also an application-level polling mechanism but the facilities provided by the OS (especially TCP keepalives and proper retransmission config) are always going to do a better job than anything we can do in userspace.

Aug 31 '23 20:08 DaveCTurner

elasticsearch elasticsearch copied to clipboard

Users are confused by disconnected remote clusters

elasticsearch
elasticsearch copied to clipboard