cross-cluster-replication icon indicating copy to clipboard operation
cross-cluster-replication copied to clipboard

[BUG] _remote/info API is giving "connected: false" when replication is undergoing with proxy mode.

Open skumarp7 opened this issue 1 year ago • 4 comments

What is the bug?

_remote/info API is returning "connected" field as false even when the replication is happening seemlessly through proxy mode.

bash-4.4$ curl  -XGET "https://opensearch-1:9200/_remote/info?pretty" 
{
  "leader-site" : {
    "connected" : false,
    "mode" : "proxy",
    "proxy_address" : "localhost:9302",
    "server_name" : "",
    "num_proxy_sockets_connected" : 0,
    "max_proxy_socket_connections" : 18,
    "initial_connect_timeout" : "30s",
    "skip_unavailable" : false
  }
}

If there are more than 1 ingest nodes in the follower, the behaviour is different. When the API was hit to ingest node 1 - the result was false through out and when we hit the API to ingest node 2 - the result was true thoughout.

bash-4.4$ curl  -XGET "https://opensearch-2:9200/_remote/info?pretty" 
{
  "leader-site" : {
    "connected" : true,
    "mode" : "proxy",
    "proxy_address" : "localhost:9302",
    "server_name" : "",
    "num_proxy_sockets_connected" : 0,
    "max_proxy_socket_connections" : 18,
    "initial_connect_timeout" : "30s",
    "skip_unavailable" : false
  }
}

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Install leader and follower opensearch (each having one ingest, one data and one master node) and create a autofollow replication rule in follower.
  2. Ensure that the indices are getting replicated to the follower cluster.
  3. Check the connection status using _remote/info API
  4. "connected" field is returning false through out even though the replication is happening.

If there are more than 1 ingest nodes in the follower, the behaviour is different. When the API was hit to ingest node 1 - the result was false through out and when we hit the API to ingest node 2 - the result was true thoughout.

What is the expected behavior? The "connected" field in _remote/info should be consistently "true" when the connection is successful across passive and active site and when the replication is performing seemlessly. Is this API valid in these scenarios?

skumarp7 avatar Nov 06 '24 07:11 skumarp7

Hi @ankitkala,

Any view on this ?

skumarp7 avatar Nov 13 '24 06:11 skumarp7

Looks like there might be a bug. Can you take a stab at why this might be happening?

ankitkala avatar Nov 15 '24 07:11 ankitkala

[Catch All Triage - 1, 2, 3, 4, 5]

dblock avatar Nov 25 '24 17:11 dblock

Hi @ankitkala ,

I can look into it. But i would like to know if this actually might impact the system at a high level. Can you check and let us know if you have any info on this as in our deployment, this is observed a lot.

skumarp7 avatar Nov 29 '24 07:11 skumarp7

Hi @dblock , @ankitkala

Any update on the above ticket?

This api is very unreliable. Please help the issue get resolved asap. Currently I don't see there is any way to validate that follower is connecting to leader when all replicated indices are in PAUSED state

skumarp7 avatar Aug 12 '25 08:08 skumarp7

Hi @skumarp7, There is no one dedicatedly supporting the repository (unless its a critical release blocker). Feel free to debug the issue. If there are any ideas/hypothesis you've that you want to discuss, one of the maintainers would definitely be able to help here.

ankitkala avatar Aug 12 '25 09:08 ankitkala