OpenSearch [BUG] Lowering of disk watermark should trigger index allocation

Describe the bug In the cluster, disk watermark is higher than the high threshold, and the new index1 is created, as it's known, all shards of the index1 will be not assigned. The usage of disk decreased for some reason, now the disk usage is lower than high watermark, higher than low watermark, but the primary shards of index1 are still not be assigned, this is not normal. the result of api _cluster/allocation/explain is as follows:

{
   "index": "index1",
   "shard": 0,
   "primary": true,
   "current_state": "unassigned",
   "unassigned_info": {
      "reason": "INDEX_CREATED",
      "at": "2023-09-08T03:52:14.851Z",
      "last_allocation_status": "no"
   },
   "can_allocate": "yes",
   "allocate_explanation": "Elasticsearch can allocate the shard.",
   "target_node": {
      "id": "VoMSBNvQTyuxjSndDoG0qg",
      "name": "node2",
      "transport_address": "127.0.0.1:9302",
      "attributes": {
         "xpack.installed": "true"
      }
   },
   "node_allocation_decisions": [
      {
         "node_id": "VoMSBNvQTyuxjSndDoG0qg",
         "node_name": "node2",
         "transport_address": "127.0.0.1:9302",
         "node_attributes": {
            "xpack.installed": "true"
         },
         "node_decision": "yes",
         "weight_ranking": 1
      }

the explain show that "Elasticsearch can allocate the shard, but shard can't be assigned"

To Reproduce Steps to reproduce the behavior: Current disk usage is 50%.

Create the file1 which size is 10Gb.
set the setting cluster.routing.allocation.disk.watermark.high to be 49.999%
create the index1.
delete file1. the disk usage will lower than 49.999%

The index1 will be not assigned.

Expected behavior We only call reroute when the disk usage is lower than low watermark, we should also call reroute when the disk watermark is lower than high watermark or higher than the low watermark. https://github.com/opensearch-project/OpenSearch/blob/c99ba635c8a7652cc6618b71e15ddde335008b8b/server/src/main/java/org/opensearch/cluster/routing/allocation/DiskThresholdMonitor.java#L231

Screenshots

[2023-09-03T13:56:40,738+0800][INFO][o.e.c.r.a.AllocationService] [master1] Cluster health status changed from [YELLOW] to [RED] (reason: [index [vecompass-log-time-2023.09.03-000014] created]).
[2023-09-03T13:56:40,773+0800][INFO][c.a.o.i.i.ManagedIndexCoordinator] [master1] Index [vecompass-log-time-2023.09.03-000014] will be managed by policy [vestack-log-ism]
[2023-09-03T13:57:12,303+0800][WARN][o.e.c.r.a.DiskThresholdMonitor] [master1] high disk watermark [90%] exceeded on [uslvEQqiSc248VPAPQclEg][master0][/es-data/nodes/0] free: 8.4gb[8.5%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete
......
[2023-09-03T14:01:12,507+0800][WARN][o.e.c.r.a.DiskThresholdMonitor] [master1] high disk watermark [90%] exceeded on [m0Sid36tSF-tD_5mC0X8eQ][master2][/es-data/nodes/0] free: 5gb[5.1%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete
[2023-09-03T14:01:12,507+0800][WARN][o.e.c.r.a.DiskThresholdMonitor] [master1] high disk watermark [90%] exceeded on [NTM1PxiiQByNGHa13jvJTw][master1][/es-data/nodes/0] free: 7.4gb[7.6%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete
[2023-09-03T14:02:12,565+0800][INFO][o.e.c.r.a.DiskThresholdMonitor] [master1] high disk watermark [90%] no longer exceeded on [uslvEQqiSc248VPAPQclEg][master0][/es-data/nodes/0] free: 14.5gb[14.8%], but low disk watermark [85%] is still exceeded
[2023-09-03T14:02:12,565+0800][INFO][o.e.c.r.a.DiskThresholdMonitor] [master1] high disk watermark [90%] no longer exceeded on [m0Sid36tSF-tD_5mC0X8eQ][master2][/es-data/nodes/0] free: 11.4gb[11.7%], but low disk watermark [85%] is still exceeded
[2023-09-03T14:02:12,565+0800][INFO][o.e.c.r.a.DiskThresholdMonitor] [master1] high disk watermark [90%] no longer exceeded on [NTM1PxiiQByNGHa13jvJTw][master1][/es-data/nodes/0] free: 13.7gb[14%], but low disk watermark [85%] is still exceeded
......
[2023-09-03T15:44:39,081+0800][INFO][o.e.c.s.ClusterSettings] [master1] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [8] to [9]
[2023-09-03T15:44:39,259+0800][INFO][o.e.c.r.a.AllocationService] [master1] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[vecompass-log-time-2023.09.03-000014][7]]]).
......
[2023-09-03T15:46:48,354+0800][INFO][o.e.c.r.a.DiskThresholdMonitor] [master1] low disk watermark [85%] no longer exceeded on [uslvEQqiSc248VPAPQclEg][master0][/es-data/nodes/0] free: 15.8gb[16.1%]
[2023-09-03T15:46:48,354+0800][INFO][o.e.c.r.a.DiskThresholdMonitor] [master1] low disk watermark [85%] no longer exceeded on [NTM1PxiiQByNGHa13jvJTw][master1][/es-data/nodes/0] free: 14.6gb[15%]
[2023-09-03T15:47:18,416+0800][INFO][o.e.c.r.a.DiskThresholdMonitor] [master1] low disk watermark [85%] exceeded on [NTM1PxiiQByNGHa13jvJTw][master1][/es-data/nodes/0] free: 14.3gb[14.6%], replicas will not be assigned to this node

To assign the shard, i change the cluster.routing.allocation.node_concurrent_incoming_recoveries to trigger the reroute.

Host/Environment (please complete the following information):

OS: mac
Version: 7.10.2

Sep 08 '23 05:09 kkewwei

@anasalkouz @kotwanikunal can you help confirm it? if it's a bug, I'm pleasure to fix.

Mar 27 '24 12:03 kkewwei

@kkewwei Can you help with the initial investigation on this, if this is by design? Feel free to use the thread to discuss/brainstorm ideas and own the issue. Based on your initial findings we can take a call whether we need a PR with fix or not.

Looking forward to your contributions on this.

May 08 '24 15:05 rwali-aws

@rwali-aws, thank you for your reply.

It seems that it's not by design, we use diskFreeBytes to represent the disk free bytes:

if diskFreeBytes < freeBytesThresholdHigh, the primary and replicas will be not allocated to the node.
if freeBytesThresholdLow < diskFreeBytes < freeBytesThresholdHigh, the primary can be allocated to the node, but replicas will be not allocated to the node.
diskFreeBytes > freeBytesThresholdLow, the primary and replicas can be allocated to the node.

Now we call method reroute in the case 1 and 3, but don't call reroute in the case 2, it's a bit unreasonable. https://github.com/opensearch-project/OpenSearch/blob/c99ba635c8a7652cc6618b71e15ddde335008b8b/server/src/main/java/org/opensearch/cluster/routing/allocation/DiskThresholdMonitor.java#L212

Jun 23 '24 10:06 kkewwei

OpenSearch OpenSearch copied to clipboard

[BUG] Lowering of disk watermark should trigger index allocation

OpenSearch
OpenSearch copied to clipboard