elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Users are confused by disconnected remote clusters

Open cjcenizal opened this issue 3 years ago • 2 comments

If a local cluster connected to a remote cluster, and the remote cluster goes down and is then recovered, the local cluster's Remote cluster info API shows the remote cluster as disconnected until:

  • Either a node disconnects and some other node is available at the time of disconnection
  • Or there are no connected nodes and a cross-cluster request is send from the local cluster

It seems users expect to use the Kibana Remote Clusters UI as an overview of the status of remote clusters, which is an unmet expectation due to this behavior. We can attempt to clarify/address the behavior in the UI (https://github.com/elastic/kibana/issues/116961), but I wonder if there's a way to address this on a fundamental level? Can a local cluster and remote cluster "reconnect" as they are brought offline and back online without requiring the above events?

CC @DaveCTurner

cjcenizal avatar Nov 01 '21 15:11 cjcenizal

Pinging @elastic/es-distributed (Team:Distributed)

elasticmachine avatar Nov 01 '21 15:11 elasticmachine

We briefly took a look at this with @tlrx . We saw there is a cluster.remote.<cluster_alias>.transport.ping_schedule setting that defaults to -1 because of TCP keep-alives. But even this setting might not understand if a remote cluster becomes connectable again after some time. We may need to investigate if we can implement some periodic connection retry method.

kingherc avatar Aug 09 '22 09:08 kingherc

I'm also seeing this issue among our users. Progress here and on the linked Kibana issue would be welcome!

mbarretta avatar Sep 09 '22 20:09 mbarretta

We may need to investigate if we can implement some periodic connection retry method.

I wonder if we should call RemoteClusterConnection#ensureConnected during the execution of GET _remote/info. It'd need to time out reasonably quickly (30s? 10s? 5s?) in case the remote cluster is still unresponsive, but that would make the response less surprising in the case that the remote cluster is available again. Something like this perhaps?

public void getRemoteConnectionInfos(ActionListener<List<RemoteConnectionInfo>> listener, TimeValue timeout) {
    final ListenableFuture<Void> completionListener = new ListenableFuture<>();
    final Runnable onTimeout;
    try (var refs = new RefCountingRunnable(() -> completionListener.onResponse(null))) {
        final List<ActionListener<Void>> connectedListeners = new ArrayList<>(remoteClusters.size());
        final List<RemoteConnectionInfo> results = Collections.synchronizedList(new ArrayList<>(remoteClusters.size()));
        completionListener.addListener(listener.map(ignored -> List.copyOf(results)));
        for (final var remoteClusterConnection : remoteClusters.values()) {
            final var ref = refs.acquire();
            final var connectedListener = ActionListener.notifyOnce(ActionListener.<Void>running(() -> {
                try (ref) {
                    results.add(remoteClusterConnection.getConnectionInfo());
                }
            }));
            connectedListeners.add(connectedListener);
            remoteClusterConnection.ensureConnected(connectedListener);
        }
        onTimeout = () -> ActionListener.onResponse(connectedListeners, null);
    }
    if (completionListener.isDone() == false) {
        ActionListener.run(completionListener, l -> {
            final var cancellable = transportService.getThreadPool().schedule(onTimeout, timeout, ThreadPool.Names.SAME);
            l.addListener(ActionListener.running(cancellable::cancel));
        });
    }
}

DaveCTurner avatar Mar 11 '23 10:03 DaveCTurner

A customer mentioned this today in a call: "Is there a CCS keep-alive built into the product? For instance, we ran into issues where the networking changed but CCS status still showed healthy. Which indicates the status isn't an active status. So two things...

  1. Monitor TCP/IP keep-alives
  2. Build in a application level poll mechanism to ensure status is accurate"

tylerperk avatar Aug 31 '23 18:08 tylerperk

the networking changed but CCS status still showed healthy

Not really sure what this means. Elasticsearch already sends (and therefore monitors) TCP keepalives by default, so it will detect a lost connection reasonably quickly, and it will also auto-reconnect on a disconnect, so we would expect it to notice a change in the network and remain healthy.

There is also an application-level polling mechanism but the facilities provided by the OS (especially TCP keepalives and proper retransmission config) are always going to do a better job than anything we can do in userspace.

DaveCTurner avatar Aug 31 '23 20:08 DaveCTurner