elasticsearch-net icon indicating copy to clipboard operation
elasticsearch-net copied to clipboard

No nodes were attempted, this can happen when a node predicate does not match any nodes

Open espjak opened this issue 2 years ago • 4 comments

NEST/Elasticsearch.Net version: 6.8.9

Elasticsearch version: 6.8.23

Description of the problem including expected versus actual behavior: We are having issue with the following exception being thrown: Elasticsearch.Net.UnexpectedElasticsearchClientException: No nodes were attempted, this can happen when a node predicate does not match any nodes ---> Elasticsearch.Net.ElasticsearchClientException: No nodes were attempted, this can happen when a node predicate does not match any nodes --- End of inner exception stack trace ---

What I have gathered:

  • It happens on multiple customers. Some experience this multiple times a week.
  • Seems to happen more on customers with heavy traffic
  • It recovers by itself
  • Issue has been occurring across multiple versions on 6.8.x (NEST and elastic)
  • The exception is thrown by multiple code paths. When it has recovered, the same code path, with the same query will succeed.
  • Happens across clusters

I recently thought that the issue could be that we were creating excessive amounts of ElasticClient's, but to my surprise the issue worsened when I moved to a singleton pattern.

We are using SniffingConnectionPool and most are connected to 3 nodes (2 data and one arbiter). The elasticsearch servers are running in Azure on windows. Client is also running in Azure.

ElasticClient creation:

private ElasticClient CreateElasticClient(string connectionString, ElasticOptions elasticOptions)
        {
            var useElastic = elasticOptions?.Use ?? false;
            var debugElastic = elasticOptions?.Debug ?? false;
            if (!useElastic || string.IsNullOrWhiteSpace(connectionString))
                return null;

            var connectionStrings = connectionString.SplitList(';', ',', '|');
            var nodes = connectionStrings.Select(conn => new Uri(conn)).ToArray();
            var connection = new HttpConnection();
            var connectionPool = new SniffingConnectionPool(nodes);
            var connectionSettings = new ConnectionSettings(connectionPool, connection, sourceSerializer: (builtin, settings) =>
                new ElasticSearchJsonSerializer(builtin, settings));

            var nodeWithUserInfo = nodes.FirstOrDefault(x => !string.IsNullOrWhiteSpace(x.UserInfo));
            if (nodeWithUserInfo != null)
            {
                var userInfo = nodeWithUserInfo.UserInfo.Split(':');
                connectionSettings.BasicAuthentication(userInfo.First(), userInfo.Last());
            }

            connectionSettings.DefaultFieldNameInferrer(p => p); // Prevents lowercasing of property names
            if (debugElastic)
            {
                connectionSettings.EnableDebugMode(details =>
                {
                    _logger.LogDebug(details.DebugInformation);
                });
            }

            return new ElasticClient(connectionSettings);
        }
    }

Basically, I require assistance in what could be causing this exception to be thrown.

Steps to reproduce: Unable to reproduce in controlled environment. But seems to happen when there is a lot of queries happening.

Provide ConnectionSettings (if relevant): Example connection string (not accessible outside Azure): "connectionString": "http://search01a-noe:9200;http://search01b-noe:9200"

espjak avatar Feb 22 '22 14:02 espjak

Hi, @espjak.

There could be several route causes for this exception but based on your description, I think the most likely cause is either nodes occasionally being unresponsive or taking too long to respond, particularly when under heavy load. Issues may not necessarily be due to the node being offline, but transient network issues between the client and server cause also explain the interruption.

The client in your configuration will sniff the state on startup to identify healthy nodes it will include in its pool of nodes that will be attempted for outbound requests. It will also track nodes that appear unresponsive and take them out of the pool, sniffing later for healthy nodes to update the node pool.

Based on the transient nature of the exceptions you are seeing, coupled with the recovery, this sounds like the pooling is behaving in the way we'd expect if some of the nodes were occasionally unresponsive. It may be that none of the nodes responded or that the max retries or timeout limits were reached.

The DebugInformation you are logging should include an audit trail of the requests which should provide more information to help understand the possible cause. Could you capture that from your logs when failures occur and provide some samples, please? You may also need the audit trail from prior requests which could show nodes not responding leading to them being marked as dead.

Given your cluster contains only a few nodes, you could try switching to the StaticConnectionPool which opts out of sniffing behaviour.

You could also try configuring the ConnectionSettings to increase the MaximumReties for requests and experimenting with the various timeout settings. For example, you could configure a shorter period for the MaxDeadTimeout to allow nodes to be reattempted more rapidly if they become marked as dead.

We always recommend reusing the same instance of ElasticClient across your application as this benefits from the pooling of HTTP connections and shared node pooling logic.

stevejgordon avatar Feb 23 '22 08:02 stevejgordon

Thanks for feedback. I will try to get debug information when it happens. Issue is running debug logging over time and catch it in the act.

Usually when there are networking issues / the server is down / or the server is overloaded (Elastic server) we get the more descriptive exceptions, like timeouts and host could not be reached. Also (forgot to write this in the initial post), we are running multiple customers on the same clusters (usually region based), and while 1 customer is getting the no nodes were attempted the other customers are having no issues, which should rule out server load / server issues. It could be transient networking issues, but its all Azure and should be pretty stable.

I am going to try tweaking those settings mentioned. I tried to find the default value for MaxDeadTimeout but this varied based on which pool you used? Do you have any insight to what the value might be? As I do not even know if we are talking seconds, minutes or hours here.

When it comes to MaximumRetries I am having a hard time relating it to this issue. From my understanding "No nodes were attempted" implies it did not even try to do the call, as the internal sniffing has identified the nodes as being unhealthy. Say you have 5 retries, would not the retries (if it even does retries when all nodes are unhealthy in sniffingpool), happen so fast (might be some incremental retries policy) that its basically void, what you really are waiting for is the sniffing to re-check the nodes and make them healthy again.

If the latter is somewhat true, if all nodes are unhealty according to the internal sniffing, is it waiting for some sort of refresh interval, or does it immidiatly re-check node health (when all is down). For me it seems its stuck in a unhealthy state for far to long, the elastic search server is fine (as all other customers running on it works), the server itself is fine (we have multiple customers on the same server, on different processes), for the same customer the mongo connection and redis connection is working fine.

espjak avatar Feb 24 '22 07:02 espjak

Hi,

I have not been able to get debug logging while it is happening. But could you take a look at what I asked regarding MaxDeadTimeout, MaximumRetries and how the internal sniffing logic works.

Just and update, as I can now confirm by several tries. Using singleton pattern / caching the ElasticClient makes the error appear MORE frequently.

We are in the works of deploying a new version which enables us to get DebugInformation / AuditTrails on failed requests. Hopefully I can provide some more logging information soon

espjak avatar May 30 '22 06:05 espjak

We have now ran with debug mode on for a while on a customer that experienced this issue multiple times a day. Unfortunately, debug mode did not provide any more information, which I reckon is due to no response actually coming from the nodes. We logged audit trails as well, and these showed healthy nodes all the time. However, enabling debug mode significantly reduced the issue from happening. Can it be something with direct streaming is turned of when debug mode is on? We also changed to StaticConnectionPool, which had a small impact as well, but not as much as debug mode on. We tried StaticConnectionPool without debug mode on to check them both in isolation.

Quick summary:

  • Debug mode did not provide any more information
  • Debug mode itself significantly reduced the occurrences of the error
  • StaticConnectionPool helped, but not as much as Debug mode

espjak avatar Jul 05 '22 07:07 espjak

I'm going to close this as it relates to 6.x, which is out of support. If it's reproducible on 7.x we can revisit it.

stevejgordon avatar Feb 13 '23 15:02 stevejgordon