cross-cluster-replication icon indicating copy to clipboard operation
cross-cluster-replication copied to clipboard

[BUG] connect_transport_exception - connect_timeout[30s]

Open bjo004 opened this issue 2 years ago • 9 comments

Describe the bug I have 2 opensearch clusters in different kubernetes regions (i.e. one in North Europe and the other in West Europe). When they're in the same region, they synchronise. When they're in different regions, I get connection timeouts.

To Reproduce Steps to reproduce the behavior:

  1. Set a leader OpenSearch instance in a kubernetes cluster in one region (e.g. North Europe)
  2. Set a follower OpenSearch instance in a kubernetes cluster in one region (e.g. West Europe)
  3. On the follower cluster, create a leader alias
  4. On the leader cluster, create a leader index
  5. On the follower OpenSearch cluster, start Cross Cluster Replication
  6. After taking a while this error is produced
  7. { "error" : { "root_cause" : [ { "type" : "connect_transport_exception", "reason" : "[opensearch-cluster-master-0][10.16.106.232:9300] connect_timeout[30s]" } ], "type" : "connect_transport_exception", "reason" : "[opensearch-cluster-master-0][10.16.106.232:9300] connect_timeout[30s]" }, "status" : 500

Expected behavior This should be output { "status" : "SYNCING", "reason" : "User initiated", "leader_alias" : "opensearch-leader-alias", "leader_index" : "opensearch-leader-index-01", "follower_index" : "opensearch-rep-01", "syncing_details" : { "leader_checkpoint" : -1, "follower_checkpoint" : -1, "seq_no" : -1 } }

Plugins opensearch-alerting opensearch-anomaly-detection opensearch-asynchronous-search opensearch-cross-cluster-replication opensearch-index-management opensearch-job-scheduler opensearch-knn opensearch-ml opensearch-observability opensearch-performance-analyzer opensearch-reports-scheduler opensearch-security opensearch-sql

Host/Environment (please complete the following information):

  • OS: [Kubernetes 1.21.9 & Helm 3.8.0]
  • OpenSearch Version [1.3.1]

Additional context Is there a way of increasing the timeouts?

Any thoughts anyone? @rohin @saikaranam-amazon

bjo004 avatar Apr 10 '22 18:04 bjo004

Looks like the the leader cluster isn't reachable from follower cluster. Can you please verify that

  1. the IP is correct.
  2. Networking is configured correctly so that the leader's IP is reachable from follower cluster's nodes.

ankitkala avatar Jul 08 '22 07:07 ankitkala

Hi @ankitkala ,

Thanks for your response.

  1. The IP is correct
  2. The network between leader and follower is reachable (i.e. I can telnet from either side).

Kind regards,

Bankole.

bjo004 avatar Jul 11 '22 09:07 bjo004

Hello,

I've got very similar problem (just in the same region) and it seems for me that OS is translating what is in the configuration to the pod IP - which is wrong because follower cluster don't know pod internal IP.

On the follower I made:

curl -XPUT -H 'Content-Type: application/json' 'http://localhost:9200/_cluster/settings?pretty' -d "
         {
         \"persistent\": {
             \"cluster\": {
             \"remote\": {
                 \"my-connection-alias\": {
                 \"seeds\": [\"${_leader_ip}\"]
                 }
             }
             }
         }
         }"

where leader_ip is 10.135.0.4:30093:

  • 10.135.0.4 is my k8s internal node IP
  • 30093 is my nodePort svc configuration

So when I'm curling to this address everything is OK:

curl 10.135.0.4:30093
This is not an HTTP port

But when I'm starting sync then I've got this:

curl -XPUT -H 'Content-Type: application/json' 'http://localhost:9200/_plugins/_replication/follower-01/_start?pretty' -d '
         {
             "leader_alias": "my-connection-alias",
             "leader_index": "leader-cluster-01",
             "use_roles":{
                 "leader_cluster_role": "all_access",
                 "follower_cluster_role": "all_access"
             }
         }'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "connect_transport_exception",
        "reason" : "[opensearch-cluster-master-0][10.244.0.53:9300] connect_timeout[30s]"
      }
    ],
    "type" : "connect_transport_exception",
    "reason" : "[opensearch-cluster-master-0][10.244.0.53:9300] connect_timeout[30s]"
  },
  "status" : 500
}

As you can see in the error message I don't have what I configured earlier (10.135.0.4:30093) but I've got the leader cluster pod internal IP (also port changed) 10.244.0.53:9300 (and follower is not able to connect to this IP)

So it seems that:

  • connection between 2 clusters is working
  • but there is some strange translation of IP address after connectiong

Are you able to help with that?

mariusz-gomse-centra avatar Feb 09 '23 10:02 mariusz-gomse-centra

hi i have the same problem can anyone tell me how to configure I am using image ver 1.3.8 , and helm chart ver 2.10.0

tu170591 avatar Feb 23 '23 16:02 tu170591

ping @mariusz-gomse-centra I don't know if you have solved the above problem, can you guide me? Thank you

tu170591 avatar Mar 07 '23 11:03 tu170591

@tu170591 unfortunately not I didn't solve it. For now I tested it only on DigitalOcean (description of my test you can see higher). I've got in my plan to test it on some other cloud provider (or locally with kind) to be 100% sure that problem is somewhere in the OS.

However now I'm 95% sure that problem is with translations. I tried to find it in the OS code but without success for now :(

Will be great if anybody will support us here 😄

mariusz-gomse-centra avatar Mar 07 '23 11:03 mariusz-gomse-centra

Hi I also made it with model k8s as:

a cluster follower on GCP -> IP Public of HAproxy -> Leader on on-premise service NodePort, same error as above "error" : { "root_cause" : [ { "type" : "connect_transport_exception", "reason" : "[opensearch-leader-master-0][10.42.35.216:9300] connect_timeout[30s]" } ], "type" : "connect_transport_exception", "reason" : "[opensearch-leader-master-0][10.42.35.216:9300] connect_timeout[30s]" }, "status" : 500 }

but i also tried on Cluster Leader in mode discovery.type: single-node, and it is success. I think my network is ok, the problem may lie in the config values. I am using helm chart version 2.2.0, and OpenSearch 1.3.8

tu170591 avatar Mar 07 '23 12:03 tu170591

@mariusz-gomse-centra @tu170591 not sure if it still relevant to use but when remote connection is configured by default it uses sniff mode which requires cluster to be able to reach every seed node in the remote cluster. And that's when it transforms the url but if on registration of node you set mode:proxy and specify proxy_address instead of seed. It won't require connectivity to all of the nodes. As always opensearch documentation was very helpful and I had to find it in the elasticsearch's one https://www.elastic.co/guide/en/elasticsearch/reference/current/remote-clusters.html#proxy-mode

danskiyq avatar Sep 12 '23 12:09 danskiyq

@mariusz-gomse-centra I am also having the same setup as you exactly, I am running into the same problem. I have been trying to resolve this from last few days but still unable to bypass this "connect_transport_exception". If you are able to debug this and want to provide any leads here.

syedsaadahmed avatar Mar 27 '24 11:03 syedsaadahmed