OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[Remote Store] _cat/recovery APIs provides inconsistent results

Open Bukhtawar opened this issue 1 year ago • 2 comments

Describe the bug

  1. When compared with total initialising shards in _cluster/health API, the _cat/recovery?active_only shows an inconsistent count of recoveries in progress.
  2. The translog download step doesn't populate the translog recovery stats

Related component

Storage:Remote

Expected behavior

Consistent API results

Bukhtawar avatar Jan 27 '24 06:01 Bukhtawar

[Triage - attendees 1 2 3 4 5 6 7 8] @Bukhtawar Thanks for filing this issue, this is a very confusing experience and it would be good to address.

peternied avatar Jan 31 '24 16:01 peternied

Hi, I would like to take this ticket. Is there a way how to reproduce the bug?

While conducting my investigation, I welcome your insights and recommendations on specific areas to focus on.

lukas-vlcek avatar Feb 16 '24 12:02 lukas-vlcek

Yes I believe this should be reproducible on a multi-node setup, hosting shards of few 100MBs, where we exclude IP of one node and trigger a relocation process of shards on the excluded node. Then compare the output of _cluster/health "intialising/relocating" count and _cat/recovery?active_only to see the discrepancy in count.

Bukhtawar avatar Feb 21 '24 16:02 Bukhtawar

@Bukhtawar Thanks! Do you think you can elaborate bit on "exclude IP of one node". Do you mean exclude the node from a shard allocation?

Would the following scenario be a good candidate?

Imagine a two node cluster, Node 1 having three primary shards, Node 2 being empty.

flowchart TB
    Primary_A
    Primary_B
    Primary_C
    subgraph "Node 1"
    Primary_A
    Primary_B
    Primary_C
    end
    subgraph "Node 2"
    end

Next, we exclude the Node 1 from a shard allocation:

PUT _cluster/settings
{
  "persistent" : {
    "cluster.routing.allocation.exclude._ip" : "_Node 1 IP_"
  }
}

This should (if I am not mistaken) trigger replication of all shards from Node 1 to Node 2.

flowchart TB
    Primary_A--"Replicating"-->Replica_A
    Primary_B--"Replicating"-->Replica_B
    Primary_C--"Replicating"-->Replica_C
    subgraph "Node 1"
    Primary_A
    Primary_B
    Primary_C
    end
    subgraph "Node 2"
    Replica_A
    Replica_B
    Replica_C
    end

Now, while shards are being replicated, we can request _cluster/health and _cat/recovery?active_only (as discussed previously) and that should give us inconsistent counts, correct?

I assume we need shards to be of a "larger size" only because we need to make sure the replication activity takes some time (enough time for us to be able to request counts and compare). How about if we instead throttle the amount of data for replication? This means that shards could be quite small but it will still take some time to replicate. Do you think this will also lead to issue reproduction?

The point is that if using throttling is possible then we should be able to implement a regular unit test.

lukas-vlcek avatar Feb 23 '24 13:02 lukas-vlcek

Hi @Bukhtawar

I was looking at this and I found that the following integration test is already testing something very similar:

./server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexRecoveryIT.java

For example it has a test called testRerouteRecovery() that uses the following scenario:

  1. It starts a cluster with a single node (A)
  2. It creates a new index with single shard and no replicas
  3. Then it adds new node (B) to the cluster
  4. Then is slows down recoveries (using RecoverySettings.INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING)
  5. Then it forces relocation of the shard using "admin.cluster.reroute" request (so a bit different strategy than discussed above, but still this triggers the recovery process)
  6. It check for count of active (ie. stage != DONE) shard recoveries etc...
  7. ...

I was experimenting and modified some tests and added "admin.cluster.health" request into them to get initializing and relocating shard counts and so far I was not able to spot/replicate the count discrepancy.

Do you think it can be because the size of the index in the test is quite small (just couple of 100kbs)? Though, the test explicitly makes sure the counts are obtained while the recovery process is throttled and the shard recovery stage is not DONE (in other words the counts are compared while the recovery is still running).

However, there is still another question I wanted to ask. Did you have anything specific in mind when you said:

The translog download step doesn't populate the translog recovery stats

Can you elaborate on this please?

I will push modification of the test tomorrow so that you can see what I mean.

lukas-vlcek avatar Mar 14 '24 19:03 lukas-vlcek

@Bukhtawar

Please see https://github.com/opensearch-project/OpenSearch/pull/12792 I believe this is very detailed try to reproduce the issue. Unfortunately, it is not reproducing the issue currently (the test passes, which means the issue does not materialize).

Can you think of some hits about what to change in order to recreate the issue?

For example, do you think the shard recovery state stage has to be Stage.TRANSLOG? Notice that in the IT the stage is currenltyStage.INDEX.

lukas-vlcek avatar Mar 20 '24 16:03 lukas-vlcek

Adding @sachinpkale for his thoughts as well. Will take a look shortly

Bukhtawar avatar Mar 21 '24 06:03 Bukhtawar