OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Cluster Health API call can get tripped by circuit breaker

Open Bukhtawar opened this issue 4 years ago • 12 comments

Describe the bug When the JVM memory pressure is high the calls to cluster health might fail with

[2021-04-05T17:37:46,637][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 1
[2021-04-05T17:37:46,631][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 0
[2021-04-05T17:37:44,838][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 0
[2021-04-05T17:37:44,838][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 0
{
    "error": {
        "root_cause": [
            {
                "type": "circuit_breaking_exception",
                "reason": "[parent] Data too large, data for [<http_request>] would be [2029039272/1.8gb], which is larger than the limit of [2023548518/1.8gb], real usage: [2029039272/1.8gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=5285/5.1kb, in_flight_requests=0/0b, accounting=50225284/47.8mb]",
                "bytes_wanted": 2029039272,
                "bytes_limit": 2023548518,
                "durability": "PERMANENT"
            }
        ],
        "type": "circuit_breaking_exception",
        "reason": "[parent] Data too large, data for [<http_request>] would be [2029039272/1.8gb], which is larger than the limit of [2023548518/1.8gb], real usage: [2029039272/1.8gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=5285/5.1kb, in_flight_requests=0/0b, accounting=50225284/47.8mb]",
        "bytes_wanted": 2029039272,
        "bytes_limit": 2023548518,
        "durability": "PERMANENT"
    },
    "status": 429
}

Expected behavior Cluster health calls shouldn't get tripped by the circuit breaker as they are important and informative and represents the state of the system

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Bukhtawar avatar Apr 28 '21 15:04 Bukhtawar

Hi @Bukhtawar,

Could you explain more about how to reproduce the issue? Looks like it has been fixed in Elasticsearch 5.0 (https://github.com/elastic/elasticsearch/commit/f32b70047241fe319cb37047cc2a47d1b56da6e1), besides, request to / is also whitelisted from Circuit Breaking exception in Elasticsearch 6.5 (https://github.com/elastic/elasticsearch/commit/027a22abf9684897a81e6ca2216dd38214fb8021).

During my own testing, I didn't find "Cluster Health API" call is tripped by circuit breaker. My steps:

  1. Start OpenSearch beta1 in Ubuntu with default setting.
  2. Set the parent circuit breaker with a low limit: curl -XPUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent" : {"indices.breaker.total.limit" : "5%"}}'
  3. Check the heap usage curl "localhost:9200/_cat/nodes?h=heap*&v", found "circuit_breaking_exception" in the response
  4. Check the cluster health curl "localhost:9200/_cluster/health?pretty", got the desired response without error.

tlfeng avatar May 04 '21 00:05 tlfeng

Looking into reproducing this issue. Will update.

anshul291995 avatar May 05 '21 04:05 anshul291995

@anshul291995 @Bukhtawar any updates here, what should we do with this?

dblock avatar Jul 16 '21 18:07 dblock

We'll need to try to repro here. I'll see if I can pick this up, any help from any community member would be of great help too

Bukhtawar avatar Jul 16 '21 18:07 Bukhtawar

@Bukhtawar @dblock would you mind if I try to reproduce and (hopefully) fix it? thanks

reta avatar Aug 18 '21 15:08 reta

So far confirming @tlfeng findings, not reproducible for /_cluster/health: the health checks are configured to bypass all circuit breakers, it applies both to rest and transport actions. Certainly more details would help:

  • OpenSearch version
  • installed Plugins?
  • where the logs are coming from? (does not look like OpenSearch server)

reta avatar Aug 18 '21 20:08 reta

@Bukhtawar @dblock would you mind if I try to reproduce and (hopefully) fix it? thanks

No need to ask for permission! Thank you for contributing.

dblock avatar Aug 31 '21 18:08 dblock

@Bukhtawar could you please help with details that @reta is seeking for? Thanks

minalsha avatar Sep 07 '21 18:09 minalsha

I'll try to see if I can repro..

Bukhtawar avatar Sep 07 '21 19:09 Bukhtawar

Closing this issue. @Bukhtawar, please feel free to reopen incase you are able to reproduce it.

anasalkouz avatar Nov 16 '21 01:11 anasalkouz

Reopening as this is an issue that needs to be fixed.

rramachand21 avatar May 02 '24 12:05 rramachand21

[Triage - attendees 1 2 3 4] @rramachand21 Do you have any additional information about reproducing this? The findings above suggest that this API should be configured to bypass all circuit breakers.

andrross avatar May 08 '24 16:05 andrross

The underlying issue as I have seen may also cause a node to not join the cluster since the node join call also gets tripped by CBE and leads to persistent node drops.

ashking94 avatar Sep 25 '24 06:09 ashking94