OpenSearch
OpenSearch copied to clipboard
[BUG] Negative memory causes _cat APIs to fail
Describe the bug _cat/indices API is failing with below error:
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Values less than -1 bytes are not supported: -9223372036853731881b"
}
],
"type" : "illegal_argument_exception",
"reason" : "Values less than -1 bytes are not supported: -9223372036853731881b"
},
"status" : 400
}
Found similar issue in Elasticsearch repo https://github.com/elastic/elasticsearch/issues/55434
To Reproduce Don't have a clean way to reproduce this.
Expected behavior _cat APIs should fail with the error described.
The root cause is the calculation here: https://github.com/opensearch-project/OpenSearch/blob/10bff0c9f5b9ca78b3dc50f5c704dabf41b9d535/server/src/main/java/org/opensearch/indices/IndicesQueryCache.java#L117-L139
In the edge case where shardStats is an empty map, the stats map is also empty (stats.size() == 0). However this case also leads to a zero value for totalSize, which gives an infinite weight which rounds to the max long value for additionalRamBytesUsed for the shard.
Then later in calculating total memory, the max long value (line 509) is added to other fields, resulting in overflow. https://github.com/opensearch-project/OpenSearch/blob/ad1c8038b01d6d82e5393d73bcbf28a43bb97bc2/server/src/main/java/org/opensearch/action/admin/indices/stats/CommonStats.java#L503-L516
Based on the comment about distributing shared ram usage, I suspect that the value of additionalRamBytesUsed should be zero in the case the stats map is empty.
Good debugging @dbwiddis :) let’s see a fix?
let’s see a fix?
Alas, while I can backtrace call hierarchies like a pro, I'm not at all familiar with this section of the code. Happy to submit a PR if someone can tell me the best resolution. Always 0? Max out the weight as 1? Some other solution? Is the empty map an indication of a deeper bug elsewhere?
@dbwiddis I think your last note about the deeper bug elsewhere is likely a cause. From what I can tell, in this particular flow there are shards present (hence the call to getStats(ShardId shard)), however from the shardStats cache perspective, it seems like there are no shards at all (it is empty).
While fixing this I believe I isolated the root cause: https://github.com/opensearch-project/OpenSearch/blob/10bff0c9f5b9ca78b3dc50f5c704dabf41b9d535/server/src/main/java/org/opensearch/indices/IndicesQueryCache.java#L198-L204
We are in a situation where all documents have been removed from the shard but the cache still retains cached filters from previous queries. This comment (or a variant) appears in clearIndex() and close() which prompts clearing the cache, but if a user uses remove() and gets it to empty state, there is still some memory assigned. To release that memory and clear the (empty) cache, one of those two methods should be called.
To Reproduce Don't have a clean way to reproduce this.
A reproducing test case already existed in o.o.i.IndicesQueryCacheTests.testBasics(), but total memory was not being tested. It has a value of Long.MAX_VALUE after the last portion (onClose(), before close()).
Just encountered this on a cluster running 3 opensearch version 1.3.11 nodes and one 1.3.8 node ( in middle of failover to the newer version) ; clearing caches fixed the error.