OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Negative memory causes _cat APIs to fail

Open anandpatel9998 opened this issue 3 years ago • 4 comments

Describe the bug _cat/indices API is failing with below error:

{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Values less than -1 bytes are not supported: -9223372036853731881b"
}
],
"type" : "illegal_argument_exception",
"reason" : "Values less than -1 bytes are not supported: -9223372036853731881b"
},
"status" : 400
}

Found similar issue in Elasticsearch repo https://github.com/elastic/elasticsearch/issues/55434

To Reproduce Don't have a clean way to reproduce this.

Expected behavior _cat APIs should fail with the error described.

anandpatel9998 avatar Sep 09 '22 21:09 anandpatel9998

The root cause is the calculation here: https://github.com/opensearch-project/OpenSearch/blob/10bff0c9f5b9ca78b3dc50f5c704dabf41b9d535/server/src/main/java/org/opensearch/indices/IndicesQueryCache.java#L117-L139

In the edge case where shardStats is an empty map, the stats map is also empty (stats.size() == 0). However this case also leads to a zero value for totalSize, which gives an infinite weight which rounds to the max long value for additionalRamBytesUsed for the shard.

Then later in calculating total memory, the max long value (line 509) is added to other fields, resulting in overflow. https://github.com/opensearch-project/OpenSearch/blob/ad1c8038b01d6d82e5393d73bcbf28a43bb97bc2/server/src/main/java/org/opensearch/action/admin/indices/stats/CommonStats.java#L503-L516

Based on the comment about distributing shared ram usage, I suspect that the value of additionalRamBytesUsed should be zero in the case the stats map is empty.

dbwiddis avatar Sep 11 '22 22:09 dbwiddis

Good debugging @dbwiddis :) let’s see a fix?

dblock avatar Sep 13 '22 19:09 dblock

let’s see a fix?

Alas, while I can backtrace call hierarchies like a pro, I'm not at all familiar with this section of the code. Happy to submit a PR if someone can tell me the best resolution. Always 0? Max out the weight as 1? Some other solution? Is the empty map an indication of a deeper bug elsewhere?

dbwiddis avatar Sep 14 '22 20:09 dbwiddis

@dbwiddis I think your last note about the deeper bug elsewhere is likely a cause. From what I can tell, in this particular flow there are shards present (hence the call to getStats(ShardId shard)), however from the shardStats cache perspective, it seems like there are no shards at all (it is empty).

reta avatar Sep 16 '22 07:09 reta

While fixing this I believe I isolated the root cause: https://github.com/opensearch-project/OpenSearch/blob/10bff0c9f5b9ca78b3dc50f5c704dabf41b9d535/server/src/main/java/org/opensearch/indices/IndicesQueryCache.java#L198-L204

We are in a situation where all documents have been removed from the shard but the cache still retains cached filters from previous queries. This comment (or a variant) appears in clearIndex() and close() which prompts clearing the cache, but if a user uses remove() and gets it to empty state, there is still some memory assigned. To release that memory and clear the (empty) cache, one of those two methods should be called.

dbwiddis avatar Mar 31 '23 18:03 dbwiddis

To Reproduce Don't have a clean way to reproduce this.

A reproducing test case already existed in o.o.i.IndicesQueryCacheTests.testBasics(), but total memory was not being tested. It has a value of Long.MAX_VALUE after the last portion (onClose(), before close()).

dbwiddis avatar Apr 02 '23 15:04 dbwiddis

Just encountered this on a cluster running 3 opensearch version 1.3.11 nodes and one 1.3.8 node ( in middle of failover to the newer version) ; clearing caches fixed the error.

mhoffm-aiven avatar Aug 03 '23 11:08 mhoffm-aiven