elasticsearch-net ReceiveBody taking extremely long and causing timeouts

ReceiveBody taking extremely long and causing timeouts

Open jonas-green opened this issue 2 years ago • 9 comments

Elasticsearch version: ElasticSearch cluster version is 7.6.2

.NET runtime version: Tried and were able to replicate with 7.10.1, 7.11 and 7.14.1

Operating system version: MacOS 11.5.2 and Debian Linux 10

Description of the problem including expected versus actual behavior: Having an issue with some requests taking an extremely long time. Issue appears inside the ES method SearchAsync.

Issue is inside the ES internal method ReceiveBody inside CallElasticsearch (Elasticsearch.Net.HttpConnection namespace), which always takes the longest time, from 2 or 4 (which is fine) up to 20+ or even 50+ seconds (which is not fine), even when the same requests take 2s. By same requests I mean that doing load testing and doing the same 2 requests with 1000 size query, some of the requests will inevitably take an unexplainably long time. After about 99th percentile, request time starts to increase exponentially: Screenshot 2021-08-31 at 11 21 57 In this example the timeout is set as 10s.

I have no way to track deeper to find where the issue could be, since it’s an internal method that takes the longest. I have to mention that this happens to a small number of requests, usually around 0.5-1%, but when doing 40-50 requests per second, it adds up. This means the issue is not with deserialization, because it would mean that all requests take that long, but with some internal issues (memory management? thread locks?). I am adding a tracing picture to show, that most deserialization takes milliseconds: ReceiveBody

Discussed with a colleague, who has worked on quite a few ES clusters and was surprised because the configuration and indexing was fine. This issue does not appear when doing the same requests calling straight to ES cluster, so it’s not an issue with objects being too big or with ES cluster and must be the internal method ReceiveBody. Here is a query profile from dev tools: 1000objects

I thought an issue might be because of tracing, but removing OpenTracing did not change the number of timeouts happening. Since this service is in a cluster, I tried quadrupling the number of nodes and there were absolutely no changes in performance. I even added pagination, so that requests will be split into small batches of 50, and the issue was still there, because the smaller requests fail less often, but there were more of them. It was replicated using HighLevel and also LowLevel clients. I created a sample solution that just uses ES and I was able to replicate the issue.

Here is a sample query, the only difference is that the providerIdsLookup can have up to 1000 Ids:

{
    "from": 0,
    "query": {
        "bool": {
            "filter": [
                {
                    "term": {
                        "status": {
                            "value": "active"
                        }
                    }
                },
                {
                    "term": {
                        "providers": {
                            "value": "TestProvider"
                        }
                    }
                },
                {
                    "terms": {
                        "providerIdsLookup": [
                            "TestProvider-city-8533"
                        ]
                    }
                }
            ]
        }
    },
    "size": 1,
    "sort": [
        {
            "_score": { }
        },
        {
            "id": { }
        }
    ],
    "_source": {
        "includes": [
            "tags",
            "id",
            "status",
            "providerIds"
        ]
    },
    "timeout": "9900ms"
}

And a sample result

{
   "took": 6,
   "timed_out": false,
   "_shards": {
       "total": 3,
       "successful": 3,
       "skipped": 0,
       "failed": 0
   },
   "hits": {
       "total": {
           "value": 1,
           "relation": "eq"
       },
       "max_score": 0.0,
       "hits": [
           {
               "_index": "placesXXXXXXXXXX",
               "_type": "_doc",
               "_id": "123456",
               "_score": 0.0,
               "_source": {
                   "providerIds": [
                       {
                           "provider": "TestProvider",
                           "providerId": "city-8533",
                           "isPrimary": true
                       },
                       {
                           "provider": "SampleProvider",
                           "providerId": "Copenhagen",
                           "isPrimary": true
                       }
                   ],
                   "id": 123456,
                   "tags": [
                       {
                           "tag": "continentcode",
                           "value": "EU",
                           "valueCommaSeparated": [
                               "EU"
                           ]
                       },
                       {
                           "tag": "countrycode",
                           "value": "DK",
                           "valueCommaSeparated": [
                               "DK"
                           ]
                       },
                       {
                           "tag": "featurecode",
                           "value": "PPLC",
                           "valueCommaSeparated": [
                               "PPLC"
                           ]
                       },
                       {
                           "tag": "iata",
                           "value": "CPH",
                           "valueCommaSeparated": [
                               "CPH"
                           ]
                       },
                       {
                           "tag": "population",
                           "value": "1153615",
                           "valueCommaSeparated": [
                               "1153615"
                           ]
                       },
                       {
                           "tag": "regioncode",
                           "value": "84",
                           "valueCommaSeparated": [
                               "84"
                           ]
                       },
                       {
                           "tag": "timezone",
                           "value": "Europe/Copenhagen",
                           "valueCommaSeparated": [
                               "Europe/Copenhagen"
                           ]
                       },
                       {
                           "tag": "placefilename",
                           "value": "Copenhagen",
                           "valueCommaSeparated": []
                       },
                       {
                           "tag": "type",
                           "value": "city",
                           "valueCommaSeparated": []
                       },
                       {
                           "tag": "id",
                           "value": "8533",
                           "valueCommaSeparated": []
                       }
                   ],
                   "status": "active"
               }
           }
       ]
   }
}

Steps to reproduce:

Create a service using ElasticSearch and use the method SearchAsync.
Run 20-50 requests per second, that look for 800-1000 objects by id.
Get timeouts

Expected behavior No timeouts for small requests

I will be happy to provide any more information needed, Thank you

Aug 31 '21 10:08 jonas-green

Hi @jonas-green. My apologies for the delay in responding to this issue. It's hard to say what the cause will be but it could be contention on various resources such as sockets or even the ThreadPool at this scale. So I can understand the scenario, is this an ASP.NET (Core) application accepting 40-50 requests per second where the Action then calls out to Elasticsearch?

To start with, can you please confirm which version of the .NET runtime you are using? Older versions had some issues with HTTP performance on Linux. Could you also please share your configuration code for the ElasticClient with any configuration you have provided and the ConnectionPool + number of nodes you are using.

It may also be worthwhile to collect some TCP and ThreadPool statistics so we can see what those look like when under load.

Did you profile to understand where the delay is occurring in the code? ReceiveBody is not actually a method, but a diagnostic event NEST emits before reading the response stream. At this point the code is calling down into the built-in HttpClient code and delays would likely be network or port contention. It should already have the headers of the response at this stage though. At scale, it may be that you're bumping into the connection limit so requests are queuing. You could try configuring the client to increase the ConnectionLimit setting, which defaults to 80 connections, and see if that alleviates the delays. On older .NET runtimes it may be even more restrictive to work around issues with the underlying .NET HTTP implementation.

Sep 06 '21 09:09 stevejgordon

Thank you for the response and the tips! Increasing the ConnectionPool number helped a bit, but issue is still there.

So I can understand the scenario, is this an ASP.NET (Core) application accepting 40-50 requests per second where the Action then calls out to Elasticsearch?

Yes, the service is actually accepting much more requests, but only when the SearchAsync gets to around 40-50 requests per second, when the issues start. Service consists of six docker nodes and processes around 500+ req/s. Actually, increasing the ConnectionLimit made that much more SearchAsync requests can be parsed, and it seemed like the 40-50 was the limit because no more were able to be parsed. The number of errors lowered a bit, but they still appear, and some requests still perform extremely badly. I set the timeout to 30s and some requests still failed - even though the same 1000 objects were being searched for.

Could you also please share your configuration code for the ElasticClient with any configuration you have provided and the ConnectionPool + number of nodes you are using.

Originally, the whole configuration was:

var connectionPool = new SingleNodeConnectionPool(new Uri(this.ConnectionString));
var connectionSettings = new ConnectionSettings(connectionPool);
var client = new ElasticClient(connectionSettings);

But now I added .ConnectionLimit(-1) just for testing purposes. The Elastic cluster itself has 14 nodes - 3 master with one being the actual master, then 8 data nodes and 3 coordination nodes. It's set behind a load balancer, so only single url is needed.

Can you please confirm which version of the .NET runtime you are using?

The service is using .NET 5.0.

Did you profile to understand where the delay is occurring in the code?

As I mentioned, it's in the internal ElasticSearch's SearchAsync method. I am not able to check any deeper, since it's an internal code (it's not even rendered properly and is shown as grey). All I know is what I can see from the stack trace. Elasticsearch.Net namespace file HttpConnection.cs has the ReceiveBody method at line 103 and 170. So it might not be that exact method, but something that also falls inside the stack trace (could be any of the ReadAsStreamAsync().GetAwaiter().GetResult()).

What is weird to me is that from the stack trace it looks like it starts deserialising right after ReceiveBody stack trace starts, but then is never finished? Connection is already made at this point since the headers have been sent and received, so I am not sure if it's the ConnectionLimit fault.

Collect some TCP and ThreadPool statistics

Here are some stats from one requests, when they took a disproportionately long time, when on a high load:

# TCP states:
  Established: 18
  TimeWait: 22

# ThreadPool statistics:
  Worker: 
    Busy: 1
    Free: 32766
    Min: 80
    Max: 32767
  IOCP: 
    Busy: 0
    Free: 1000
    Min: 80
    Max: 1000

Request where even sending and receiving headers took 8+ seconds:

# TCP states:
  Established: 87
  TimeWait: 81
  FinWait1: 2

# ThreadPool statistics:
  Worker: 
    Busy: 20
    Free: 32747
    Min: 80
    Max: 32767
  IOCP: 
    Busy: 0
    Free: 1000
    Min: 80
    Max: 1000

If you need any more information, I am free to provide. Thank you and I hope we can find a solution!

Sep 07 '21 13:09 jonas-green

Thanks for the extra information @jonas-green.

Indeed, by the time the ReceiveBody diagnostic is emitted, we should have the initial response from the server and its headers. After that, we're just waiting for the body to stream down, before deserialising it. The fact it's delayed there suggests possibly a network/transport issue. You mentioned that quadrupling the number of instances of this service didn't help, which also tends to suggest it may not be client side resource contention but something between the client and server. Do you happen to have access to any logs from the proxy in front of Elasticsearch to see if those show anything useful?

When you tested directly with ES, were you achieving the same combined load over the network and to the nodes as the client application reaches? You could enable the slow log on the server nodes and review those to validate whether any queries appear there.

Something you could also try is attempting to repro this without using NEST and instead issue a search via HttpClient directly. It would help isolate whether this is within the client itself or perhaps the .NET stack or the network.

You mentioned a stack trace, do you have one you could share please? Also, could you please capture the contents of the DebugInformation property on some of the slower responses? When you're testing this and capturing stack traces etc. are you also also running on Linux?

Sep 09 '21 09:09 stevejgordon

Hello again,

After creating the same test using only HttpClient it showed that the issue was still present... So issue is not completely related to the NEST library, but more with the network.

When running load testing, for example running 100 threads, then I get 100 connections established to the Elastic cluster. After a few minutes the connections start to close, and a new ones are created. It seems like most of the slow queries are the first ones to hit a new connection.

Checked the load test directly to the ES cluster and it seems like the queries were not exactly the same. After fixing it, the same graph appeared, that a small amount of requests were much longer (although not as long, but still >7s).

Slow logging did not show anything useful. The DebugInformation on slower queries is already written in my past comment. I am running MacOS 11.5 and I can reproduce the issue. The docker nodes are running in Debian Linux 10.

As I understand EnableTcpKeepAlive makes sense only when requests stop, but it does not stop the active connections from closing.

The fields I'm searching on (status, providers and providerIdsLookup) are all mapped as keyword. eager_global_ordinals does not seem like a solution, since there are around 2-3 million unique values for providerIdsLookup. Although I think that memory caching will not really help here since hitting the same 8 queries reproduces the issue.

Would there be any way to fix the first slow request issue or is there any setting so that connections are kept alive?

Sep 21 '21 08:09 jonas-green

@jonas-green did you ever find the cause of this? I am seeing some similar issues with nest - also in a docker setting

Nov 15 '21 16:11 alextraul

Something you could also try is attempting to repro this without using NEST

That's what we are doing, sadly. We are working on recreating our most-used requests with HttpClient :) I think it could be an issue with HttpClient not closing connections, but since we can't do much inside the Nest package, we are working on our own to find the root cause...

Nov 16 '21 14:11 jonas-green

I've been recently testing my local ES server that I have in a local VM, and when changing the connection string from localhost:9200 to the IP:9200, I saw how the performance boost up. May be the problem is in the Nest mechanism used to connect to the server. Hope it can help you!

Jan 18 '22 23:01 jiman14

@jiman14 That's interesting extra info, although somewhat surprising I must say. In terms of connections, that's all handled in the .NET HTTP stack so there's little we can influence at the client level in that regard. If you have a repeatable reproduction it would be interesting to see it.

@jonas-green If the issue is still present with pure HttpClient then this might be something we can raise on the .NET repo. For the debug info, I was interested to see the audit trail for the slow requests which is at the beginning of the debug info. e.g.

Valid NEST response built from a successful (200) low level call on POST: /test/_search?pretty=true&error_trace=true&typed_keys=true
# Audit trail of this API call:
 - [1] ProductCheckOnStartup: Took: 00:00:03.8575995
 - [2] ProductCheckSuccess: Node: https://localhost:9600/ Took: 00:00:03.3856732
 - [3] HealthyResponse: Node: https://localhost:9600/ Took: 00:00:00.6122736

As to why the connections are closing, I'm not sure. As far as I know, they should remain open. Are you able to capture any Wireshark packet captures? I'm interested to see if we can identify whether it's the client or server actively closing the connection. As for first request time on new connections, it will be slower during the handshake, but shouldn't be as significant as you're seeing.

Feb 01 '22 16:02 stevejgordon

Having the same issue with a setup that's quite similar to OP. Was the issue raised on the .NET repo?

Apr 07 '22 19:04 MoazAlkharfan

I'm closing this as stale. If we can gather more info, such as packet captures, we may be able to assist further and reopen.

Apr 11 '23 09:04 stevejgordon

elasticsearch-net elasticsearch-net copied to clipboard

ReceiveBody taking extremely long and causing timeouts

elasticsearch-net
elasticsearch-net copied to clipboard