EventStore-Client-NodeJS icon indicating copy to clipboard operation
EventStore-Client-NodeJS copied to clipboard

discovery cluster connection with multiple DNS entry does not work if first node is down

Open innocentiv opened this issue 9 months ago β€’ 7 comments

πŸ› Current behavior

I am using the latest version of kurrentDB node client (1.0.1) to connect to a cluster of 3 eventstoredb machines (v23.10).

I resolve the cluster using this connection string:

kurrentdb+discover://user:[email protected]:2113,02.com:2113,03.com:2113

If 02 or 03 are down the client connection works perfectly fine. If 01 is down I cannot connect with error:

[Error [UnavailableError]: UnavailableError] { metadata: {} }

This is caused by the nodejs client using only the first host for discovery:

https://github.com/kurrent-io/KurrentDB-Client-NodeJS/blob/master/packages/db-client/src/Client/index.ts

if (options.dnsDiscover) {
      const [discover] = options.hosts;

      if (options.hosts.length > 1) {
        debug.connection(
          `More than one address provided for discovery. Using first: ${discover.address}:${discover.port}.`
        );
      }

      return new Client(
        rustClient,
        {
          discover,
          nodePreference: options.nodePreference,
          discoveryInterval: options.discoveryInterval,
          gossipTimeout: options.gossipTimeout,
          maxDiscoverAttempts: options.maxDiscoverAttempts,
          throwOnAppendFailure: options.throwOnAppendFailure,
          keepAliveInterval: options.keepAliveInterval,
          keepAliveTimeout: options.keepAliveTimeout,
          defaultDeadline: options.defaultDeadline,
          connectionName: options.connectionName,
        },
        channelCredentials,
        options.defaultCredentials
      );
    }

According to a discussion on kurrent discord general channel this is a bug on the client. I would like to better understand what is the correct approach to maintain availability on a 2n+1 cluster when the first node is down. This is also necessary to enable the incremental upgrade of a running cluster.

πŸ” Steps to reproduce

  • setup a cluster of 3 nodes
  • add dns entries for each of the nodes
  • setup a client to connect to the 3 node via discovery
  • bring down the first node
  • the client unable to connect to the cluster

Reproducible link

https://codesandbox.io/p/devbox/w5v9md

πŸ’­ Expected behavior

When the first node in the connection string is down, and the other 2 nodes are up, I can connect to the cluster anyway.

Package version

KurrentDB-NodeJS-Client 1.0.1

KurrentDB Version

EventstoreDB 23.10

Connection string

kurrentdb+discover://user:[email protected]:2113,02.com:2113,03.com:2113

☁️ Deployment Environment

Multi-node cluster (Cloud)

Other Deployment Details

No response

Operating system

No response

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

innocentiv avatar May 28 '25 12:05 innocentiv

Hey @innocentiv, we will remediate this as soon as we can. The code always assumed that a cluster DNS was used, so it picked the first item in the host list. It should throw an error instead of just logging a debug message. However, we are currently working on improving the developer experience, which will probably involve a breaking change to clarify and better document this behavior

w1am avatar May 30 '25 13:05 w1am

@w1am I do not think that is actually a valid workaround. What I am trying to solve is having the client being able to connect to a cluster even if one of the nodes is down. Currently only the first node of the string is used for discovery, but if I use round robin DNS configuration then only one of the node will be randomly used for discovery. This is indeed better, but not a good solution.

It is an ok workaround for cluster upgrade and maintenance: I can remove the node from the DNS, wait for DNS propagation and cache, then replace the node that is down and restore the DNS configuration.

It is not ok for availability as that means that my client may be unable to connect to the cluster anyway.

Am I mistaken in my assessment?

innocentiv avatar Jun 02 '25 10:06 innocentiv

FWIW, here is a similar discussion (https://github.com/kurrent-io/KurrentDB-Client-Java/issues/273) and resolution (https://github.com/kurrent-io/KurrentDB-Client-Java/pull/288) in the context of the Java client, which may also be relevant here.

lbodor avatar Jun 02 '25 23:06 lbodor

@innocentiv With your current setup, using +discover alongside multiple gossip seeds is ineffective. In the current implementation, It's best to remove +discover from your connection string. When multiple gRPC targets are set, the client will query each Gossip API to get cluster info, then picks a node based on the URI’s node preference.

Efforts are underway to harmonize the behavior of discovery and seed configs

w1am avatar Jun 03 '25 15:06 w1am

@w1am I followed the documentation for Kurrent Client:

The KurrentDB connection string supports two schemas: kurrentdb:// for connecting to a single-node server, and kurrentdb+discover:// for connecting to a multi-node cluster. The difference between the two schemas is that when using kurrentdb://, the client will connect directly to the node; with kurrentdb+discover:// schema the client will use the gossip protocol to retrieve the cluster information and choose the right node to connect to. Since version 22.10, ESDB supports gossip on single-node deployments, so kurrentdb+discover:// schema can be used for connecting to any topology.

But if what you say is true (and the test I have done confirm this so far), then for reliability, in a cluster environment, it is actually better to use kurrentdb:// with multiple gossip seed instead of using kurrentdb+discover:// with round robin DNS cluster configuration. Is that correct?

innocentiv avatar Jun 05 '25 12:06 innocentiv

@innocentiv That's right. This is a problem and it's causing confusion. That's why we're fixing it.

w1am avatar Jun 05 '25 12:06 w1am