OpenSearch
OpenSearch copied to clipboard
[Remote Store] _cat/recovery APIs provides inconsistent results
Describe the bug
- When compared with total initialising shards in
_cluster/healthAPI, the_cat/recovery?active_onlyshows an inconsistent count of recoveries in progress. - The translog download step doesn't populate the translog recovery stats
Related component
Storage:Remote
Expected behavior
Consistent API results
[Triage - attendees 1 2 3 4 5 6 7 8] @Bukhtawar Thanks for filing this issue, this is a very confusing experience and it would be good to address.
Hi, I would like to take this ticket. Is there a way how to reproduce the bug?
While conducting my investigation, I welcome your insights and recommendations on specific areas to focus on.
Yes I believe this should be reproducible on a multi-node setup, hosting shards of few 100MBs, where we exclude IP of one node and trigger a relocation process of shards on the excluded node.
Then compare the output of _cluster/health "intialising/relocating" count and _cat/recovery?active_only to see the discrepancy in count.
@Bukhtawar Thanks! Do you think you can elaborate bit on "exclude IP of one node". Do you mean exclude the node from a shard allocation?
Would the following scenario be a good candidate?
Imagine a two node cluster, Node 1 having three primary shards, Node 2 being empty.
flowchart TB
Primary_A
Primary_B
Primary_C
subgraph "Node 1"
Primary_A
Primary_B
Primary_C
end
subgraph "Node 2"
end
Next, we exclude the Node 1 from a shard allocation:
PUT _cluster/settings
{
"persistent" : {
"cluster.routing.allocation.exclude._ip" : "_Node 1 IP_"
}
}
This should (if I am not mistaken) trigger replication of all shards from Node 1 to Node 2.
flowchart TB
Primary_A--"Replicating"-->Replica_A
Primary_B--"Replicating"-->Replica_B
Primary_C--"Replicating"-->Replica_C
subgraph "Node 1"
Primary_A
Primary_B
Primary_C
end
subgraph "Node 2"
Replica_A
Replica_B
Replica_C
end
Now, while shards are being replicated, we can request _cluster/health and _cat/recovery?active_only (as discussed previously) and that should give us inconsistent counts, correct?
I assume we need shards to be of a "larger size" only because we need to make sure the replication activity takes some time (enough time for us to be able to request counts and compare). How about if we instead throttle the amount of data for replication? This means that shards could be quite small but it will still take some time to replicate. Do you think this will also lead to issue reproduction?
The point is that if using throttling is possible then we should be able to implement a regular unit test.
Hi @Bukhtawar
I was looking at this and I found that the following integration test is already testing something very similar:
./server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexRecoveryIT.java
For example it has a test called testRerouteRecovery() that uses the following scenario:
- It starts a cluster with a single node (A)
- It creates a new index with single shard and no replicas
- Then it adds new node (B) to the cluster
- Then is slows down recoveries (using
RecoverySettings.INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING) - Then it forces relocation of the shard using "admin.cluster.reroute" request (so a bit different strategy than discussed above, but still this triggers the recovery process)
- It check for count of active (ie.
stage != DONE) shard recoveries etc... - ...
I was experimenting and modified some tests and added "admin.cluster.health" request into them to get initializing and relocating shard counts and so far I was not able to spot/replicate the count discrepancy.
Do you think it can be because the size of the index in the test is quite small (just couple of 100kbs)? Though, the test explicitly makes sure the counts are obtained while the recovery process is throttled and the shard recovery stage is not DONE (in other words the counts are compared while the recovery is still running).
However, there is still another question I wanted to ask. Did you have anything specific in mind when you said:
The translog download step doesn't populate the translog recovery stats
Can you elaborate on this please?
I will push modification of the test tomorrow so that you can see what I mean.
@Bukhtawar
Please see https://github.com/opensearch-project/OpenSearch/pull/12792 I believe this is very detailed try to reproduce the issue. Unfortunately, it is not reproducing the issue currently (the test passes, which means the issue does not materialize).
Can you think of some hits about what to change in order to recreate the issue?
For example, do you think the shard recovery state stage has to be Stage.TRANSLOG? Notice that in the IT the stage is currenltyStage.INDEX.
Adding @sachinpkale for his thoughts as well. Will take a look shortly