cross-cluster-replication
cross-cluster-replication copied to clipboard
[BUG] connect_transport_exception - connect_timeout[30s]
Describe the bug I have 2 opensearch clusters in different kubernetes regions (i.e. one in North Europe and the other in West Europe). When they're in the same region, they synchronise. When they're in different regions, I get connection timeouts.
To Reproduce Steps to reproduce the behavior:
- Set a leader OpenSearch instance in a kubernetes cluster in one region (e.g. North Europe)
- Set a follower OpenSearch instance in a kubernetes cluster in one region (e.g. West Europe)
- On the follower cluster, create a leader alias
- On the leader cluster, create a leader index
- On the follower OpenSearch cluster, start Cross Cluster Replication
- After taking a while this error is produced
- { "error" : { "root_cause" : [ { "type" : "connect_transport_exception", "reason" : "[opensearch-cluster-master-0][10.16.106.232:9300] connect_timeout[30s]" } ], "type" : "connect_transport_exception", "reason" : "[opensearch-cluster-master-0][10.16.106.232:9300] connect_timeout[30s]" }, "status" : 500
Expected behavior This should be output { "status" : "SYNCING", "reason" : "User initiated", "leader_alias" : "opensearch-leader-alias", "leader_index" : "opensearch-leader-index-01", "follower_index" : "opensearch-rep-01", "syncing_details" : { "leader_checkpoint" : -1, "follower_checkpoint" : -1, "seq_no" : -1 } }
Plugins opensearch-alerting opensearch-anomaly-detection opensearch-asynchronous-search opensearch-cross-cluster-replication opensearch-index-management opensearch-job-scheduler opensearch-knn opensearch-ml opensearch-observability opensearch-performance-analyzer opensearch-reports-scheduler opensearch-security opensearch-sql
Host/Environment (please complete the following information):
- OS: [Kubernetes 1.21.9 & Helm 3.8.0]
- OpenSearch Version [1.3.1]
Additional context Is there a way of increasing the timeouts?
Any thoughts anyone? @rohin @saikaranam-amazon
Looks like the the leader cluster isn't reachable from follower cluster. Can you please verify that
- the IP is correct.
- Networking is configured correctly so that the leader's IP is reachable from follower cluster's nodes.
Hi @ankitkala ,
Thanks for your response.
- The IP is correct
- The network between leader and follower is reachable (i.e. I can telnet from either side).
Kind regards,
Bankole.
Hello,
I've got very similar problem (just in the same region) and it seems for me that OS is translating what is in the configuration to the pod IP - which is wrong because follower cluster don't know pod internal IP.
On the follower I made:
curl -XPUT -H 'Content-Type: application/json' 'http://localhost:9200/_cluster/settings?pretty' -d "
{
\"persistent\": {
\"cluster\": {
\"remote\": {
\"my-connection-alias\": {
\"seeds\": [\"${_leader_ip}\"]
}
}
}
}
}"
where leader_ip is 10.135.0.4:30093
:
- 10.135.0.4 is my k8s internal node IP
- 30093 is my nodePort svc configuration
So when I'm curling to this address everything is OK:
curl 10.135.0.4:30093
This is not an HTTP port
But when I'm starting sync then I've got this:
curl -XPUT -H 'Content-Type: application/json' 'http://localhost:9200/_plugins/_replication/follower-01/_start?pretty' -d '
{
"leader_alias": "my-connection-alias",
"leader_index": "leader-cluster-01",
"use_roles":{
"leader_cluster_role": "all_access",
"follower_cluster_role": "all_access"
}
}'
{
"error" : {
"root_cause" : [
{
"type" : "connect_transport_exception",
"reason" : "[opensearch-cluster-master-0][10.244.0.53:9300] connect_timeout[30s]"
}
],
"type" : "connect_transport_exception",
"reason" : "[opensearch-cluster-master-0][10.244.0.53:9300] connect_timeout[30s]"
},
"status" : 500
}
As you can see in the error message I don't have what I configured earlier (10.135.0.4:30093
) but I've got the leader cluster pod internal IP (also port changed) 10.244.0.53:9300
(and follower is not able to connect to this IP)
So it seems that:
- connection between 2 clusters is working
- but there is some strange translation of IP address after connectiong
Are you able to help with that?
hi i have the same problem can anyone tell me how to configure I am using image ver 1.3.8 , and helm chart ver 2.10.0
ping @mariusz-gomse-centra I don't know if you have solved the above problem, can you guide me? Thank you
@tu170591 unfortunately not I didn't solve it. For now I tested it only on DigitalOcean (description of my test you can see higher). I've got in my plan to test it on some other cloud provider (or locally with kind) to be 100% sure that problem is somewhere in the OS.
However now I'm 95% sure that problem is with translations. I tried to find it in the OS code but without success for now :(
Will be great if anybody will support us here 😄
Hi I also made it with model k8s as:
a cluster follower on GCP -> IP Public of HAproxy -> Leader on on-premise service NodePort, same error as above "error" : { "root_cause" : [ { "type" : "connect_transport_exception", "reason" : "[opensearch-leader-master-0][10.42.35.216:9300] connect_timeout[30s]" } ], "type" : "connect_transport_exception", "reason" : "[opensearch-leader-master-0][10.42.35.216:9300] connect_timeout[30s]" }, "status" : 500 }
but i also tried on Cluster Leader in mode discovery.type: single-node, and it is success. I think my network is ok, the problem may lie in the config values. I am using helm chart version 2.2.0, and OpenSearch 1.3.8
@mariusz-gomse-centra @tu170591 not sure if it still relevant to use but when remote connection is configured by default it uses sniff mode which requires cluster to be able to reach every seed node in the remote cluster. And that's when it transforms the url but if on registration of node you set mode:proxy and specify proxy_address instead of seed. It won't require connectivity to all of the nodes. As always opensearch documentation was very helpful and I had to find it in the elasticsearch's one https://www.elastic.co/guide/en/elasticsearch/reference/current/remote-clusters.html#proxy-mode
@mariusz-gomse-centra I am also having the same setup as you exactly, I am running into the same problem. I have been trying to resolve this from last few days but still unable to bypass this "connect_transport_exception". If you are able to debug this and want to provide any leads here.