hazelcast-kubernetes icon indicating copy to clipboard operation
hazelcast-kubernetes copied to clipboard

Weird behavior of TcpServerConnection

Open tessob opened this issue 3 years ago • 13 comments

Hello guys!

I rather need a clarification, but not a help. Currently I'm trying to run application in K8s and everything is good so far, but I have a little bit weird logs:

2021-04-08 13:51:59.890  INFO 1 --- [.IO.thread-in-0] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=10, /10.154.24.57:5701->/10.154.1.219:15106, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:51:59.890  INFO 1 --- [.IO.thread-in-1] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=11, /10.154.24.57:5701->/10.154.47.197:32686, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:51:59.890  INFO 1 --- [.IO.thread-in-2] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=12, /10.154.24.57:5701->/10.154.5.157:10454, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:00.911  INFO 1 --- [ration.thread-0] c.h.internal.cluster.ClusterService      : [10.154.24.57]:5701 [WORKFLOW] [4.2] 

Members {size:2, ver:2} [
	Member [10.154.24.57]:5701 - 94f6d56b-a72d-4d0a-82ee-a33e47053f78 this
	Member [10.154.92.185]:5701 - 1f66b061-f995-46c0-b886-dd0ee1a49dc3
]

2021-04-08 13:52:04.335  INFO 1 --- [.IO.thread-in-0] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=13, /10.154.24.57:5701->/10.154.6.102:16924, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:04.335  INFO 1 --- [.IO.thread-in-1] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=14, /10.154.24.57:5701->/10.154.19.176:5548, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:04.336  INFO 1 --- [.IO.thread-in-2] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=15, /10.154.24.57:5701->/10.154.5.124:47016, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:04.336  INFO 1 --- [.IO.thread-in-0] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=16, /10.154.24.57:5701->/10.154.1.219:52814, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:09.889  INFO 1 --- [.IO.thread-in-2] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=18, /10.154.24.57:5701->/10.154.19.176:32638, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:09.889  INFO 1 --- [.IO.thread-in-1] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=17, /10.154.24.57:5701->/10.154.29.156:1518, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side
2021-04-08 13:52:09.889  INFO 1 --- [.IO.thread-in-0] c.h.i.server.tcp.TcpServerConnection     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Connection[id=19, /10.154.24.57:5701->/10.154.47.197:32712, qualifier=null, endpoint=null, alive=false, connectionType=NONE, planeIndex=-1] closed. Reason: Connection closed by the other side

It seems like current node constantly trying to establish connection with other pods "using all variety of irrelevant ports". Is it normal behavior or not? Is there a way to restrict it somehow?

My configs:

  @Bean
  public Config hazelcastConfig() {
    Config cfg = new Config();
    cfg.setClusterName("WORKFLOW");
    cfg.getNetworkConfig().setPort(5701).setPortAutoIncrement(false);
    cfg.setSerializationConfig(serializationConfig());
    cfg.setMapConfigs(getMapConfigs());
    cfg.setProperty("hazelcast.logging.type", "slf4j");

    if (isApplicationInsideKube()) {
      JoinConfig joinConfig = cfg.getNetworkConfig().getJoin();
      joinConfig.getAutoDetectionConfig().setEnabled(false);
      joinConfig.getTcpIpConfig().setEnabled(false);
      joinConfig.getMulticastConfig().setEnabled(false);
      joinConfig.getKubernetesConfig().setEnabled(true)
          .setProperty("service-name", "workflow-service")
          .setProperty("service-port", "5701");
      log.info(joinConfig.toString());
    }

    log.info(cfg.toString());
    return cfg;
  }

Best regards & Thank you in advance,

tessob avatar Apr 08 '21 14:04 tessob

@tessob thanks for submitting the issue. Could you please tell us more about your k8s environment? Are there any proxy/service mesh setup?

alparslanavci avatar Apr 08 '21 14:04 alparslanavci

@alparslanavci no, we have no proxy/service mesh. Our Service manifest is similar to this one. There are just two pods in separate namespace now. However cluster itself is bigger.

Our JoinConfig on application start looks like this:

JoinConfig{multicastConfig=MulticastConfig [enabled=false, multicastGroup=224.2.2.3, multicastPort=54327, multicastTimeToLive=32, multicastTimeoutSeconds=2, trustedInterfaces=[], loopbackModeEnabled=true], tcpIpConfig=TcpIpConfig [enabled=false, connectionTimeoutSeconds=5, members=[], requiredMember=null], awsConfig=AliasedDiscoveryConfig{tag='aws', enabled=false, usePublicIp=false, properties={}}, gcpConfig=AliasedDiscoveryConfig{tag='gcp', enabled=false, usePublicIp=false, properties={}}, azureConfig=AliasedDiscoveryConfig{tag='azure', enabled=false, usePublicIp=false, properties={}}, kubernetesConfig=AliasedDiscoveryConfig{tag='kubernetes', enabled=true, usePublicIp=false, properties={service-name=workflow-service, service-port=5701}}, eurekaConfig=AliasedDiscoveryConfig{tag='eureka', enabled=false, usePublicIp=false, properties={}}, discoveryConfig=DiscoveryConfig{discoveryStrategyConfigs=[], discoveryServiceProvider=null, nodeFilter=null, nodeFilterClass='null'}, autoDetectionConfig=AutoDetectionConfig{enabled=false}}

and Discovery properties like this:

Kubernetes Discovery properties: { service-dns: null, service-dns-timeout: 5, service-name: workflow-service, service-port: 5701, service-label: null, service-label-value: true, namespace: XXXXX, pod-label: null, pod-label-value: null, resolve-not-ready-addresses: true, use-node-name-as-external-address: false, kubernetes-api-retries: 3, kubernetes-master: https://kubernetes.default.svc}

and there are few additional messages that I don't realy know hot to interprete:

2021-04-08 13:51:46.207  INFO 1 --- [           main] c.h.s.d.integration.DiscoveryService     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Kubernetes Discovery activated with mode: KUBERNETES_API
2021-04-08 13:51:46.499  INFO 1 --- [           main] com.hazelcast.instance.impl.Node         : [10.154.24.57]:5701 [WORKFLOW] [4.2] Using Discovery SPI
2021-04-08 13:51:47.783  INFO 1 --- [           main] com.hazelcast.core.LifecycleService      : [10.154.24.57]:5701 [WORKFLOW] [4.2] [10.154.24.57]:5701 is STARTING
2021-04-08 13:51:48.603  INFO 1 --- [           main] c.h.s.d.integration.DiscoveryService     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Kubernetes plugin discovered availability zone: eu-XXXXXX
2021-04-08 13:51:48.771  INFO 1 --- [           main] c.h.s.d.integration.DiscoveryService     : [10.154.24.57]:5701 [WORKFLOW] [4.2] Kubernetes plugin discovered node name: ip-XXXXXXX.compute.internal
2021-04-08 13:51:48.990  WARN 1 --- [           main] c.hazelcast.kubernetes.KubernetesClient  : Cannot fetch public IPs of Hazelcast Member PODs, you won't be able to use Hazelcast Smart Client from outside of the Kubernetes network

I hope it would be helpful. BR,

tessob avatar Apr 08 '21 15:04 tessob

Short update,

I have checked out morning logs with 20K attempts to find other nodes and then taken a look at source codes to comprehend where the client has got the list of irrelevant endpoints.

I reproduced this API call manually with the following curl query:

kubectl proxy --port=0 &
curl http://localhost:36891/api/v1/namespaces/XXXXXXXXX/endpoints/workflow-service

and got this output:

{
  "kind": "Endpoints",
  "apiVersion": "v1",
  "metadata": {
    "name": "workflow-service",
    "namespace": "XXXXXXXXX",
    "selfLink": "/api/v1/namespaces/XXXXXXXXX/endpoints/workflow-service",
    "uid": "64ffb585-e68a-41bc-97bf-fb54e8f3fdc1",
    "resourceVersion": "113176905",
    "creationTimestamp": "2021-04-08T11:01:06Z",
    "labels": {
      "fluxcd.io/sync-gc-mark": "sha256.VJun3ud720pxL2mM3fjAbo3OoiEfk8bQDeZA8wZQx7c"
    }
  },
  "subsets": [
    {
      "addresses": [
        {
          "ip": "10.154.24.57",
          "nodeName": "XXXXXXXXX.compute.internal",
          "targetRef": {
            "kind": "Pod",
            "namespace": "XXXXXXXXX",
            "name": "workflow-tracker-service-5769dc77cf-9rnhp",
            "uid": "7c84a0c5-7d9b-4cb9-b1d8-d6bb0d08979e",
            "resourceVersion": "113176843"
          }
        },
        {
          "ip": "10.154.92.185",
          "nodeName": "XXXXXXXXX.compute.internal",
          "targetRef": {
            "kind": "Pod",
            "namespace": "XXXXXXXXX",
            "name": "workflow-tracker-service-5769dc77cf-zf9pd",
            "uid": "29737169-ff5e-49ba-902f-1756dc4e33db",
            "resourceVersion": "113176895"
          }
        }
      ],
      "ports": [
        {
          "name": "workflow-s",
          "port": 8080,
          "protocol": "TCP"
        },
        {
          "name": "hazelcast",
          "port": 5701,
          "protocol": "TCP"
        }
      ]
    }
  ]
}

So, as I told, I have just 2 pods relevant to the Hazelcast cluster. Hence I don't really understand what is the source IPs that TcpServerConnection is trying to connect. Looks like a bug.

Best regards,

tessob avatar Apr 09 '21 07:04 tessob

It's hard to tell where Hazelcast tries to connect to. Some more questions:

  1. If you follow exactly the code in this guide, do you experience the same?
  2. If you deploy your application on some widely-used Kubernetes cluster (like Docker Desktop, or GKE, or EKS), do you experience the same?

leszko avatar Apr 09 '21 09:04 leszko

Hi @leszko,

  1. We have tested as in guide, in the same k8s cluster and namespace, and everything is OK. Logs:
########################################
# JAVA_OPTS=-Djava.net.preferIPv4Stack=true -Dhazelcast.logging.type=log4j2 -Dlog4j.configurationFile=/opt/hazelcast/log4j2.properties -XX:MaxRAMPercentage=80.0 -XX:MaxGCPauseMillis=5 --add-modules java.se --add-exports java.base/jdk.internal.ref=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED -Dhazelcast.config=/data/hazelcast/hazelcast.yaml -DserviceName=hazelcast -Dnamespace=XXXXXXXXX -Dhazelcast.shutdownhook.policy=GRACEFUL -Dhazelcast.shutdownhook.enabled=true -Dhazelcast.graceful.shutdown.max.wait=600 
# CLASSPATH=/opt/hazelcast/*:/opt/hazelcast/lib/*
# starting now....
########################################
+ exec java -server -Djava.net.preferIPv4Stack=true -Dhazelcast.logging.type=log4j2 -Dlog4j.configurationFile=/opt/hazelcast/log4j2.properties -XX:MaxRAMPercentage=80.0 -XX:MaxGCPauseMillis=5 --add-modules java.se --add-exports java.base/jdk.internal.ref=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED -Dhazelcast.config=/data/hazelcast/hazelcast.yaml -DserviceName=hazelcast -Dnamespace=XXXXXXXXX -Dhazelcast.shutdownhook.policy=GRACEFUL -Dhazelcast.shutdownhook.enabled=true -Dhazelcast.graceful.shutdown.max.wait=600 com.hazelcast.core.server.HazelcastMemberStarter
2021-04-09 10:39:31,214 [[32m INFO[m] [[36mmain[m] [[34mc.h.i.c.AbstractConfigLocator[m]: Loading configuration '/data/hazelcast/hazelcast.yaml' from System property 'hazelcast.config'
2021-04-09 10:39:31,219 [[32m INFO[m] [[36mmain[m] [[34mc.h.i.c.AbstractConfigLocator[m]: Using configuration file at /data/hazelcast/hazelcast.yaml
2021-04-09 10:39:31,360 [[32m INFO[m] [[36mmain[m] [[34mc.h.system[m]: [10.154.60.11]:5701 [dev] [4.2] Hazelcast 4.2 (20210324 - 405cfd1) starting at [10.154.60.11]:5701
2021-04-09 10:39:31,749 [[32m INFO[m] [[36mmain[m] [[34mc.h.s.d.i.DiscoveryService[m]: [10.154.60.11]:5701 [dev] [4.2] Kubernetes Discovery properties: { service-dns: null, service-dns-timeout: 5, service-name: hazelcast, service-port: 0, service-label: null, service-label-value: true, namespace: XXXXXXXXX, pod-label: null, pod-label-value: null, resolve-not-ready-addresses: true, use-node-name-as-external-address: false, kubernetes-api-retries: 3, kubernetes-master: https://kubernetes.default.svc}
2021-04-09 10:39:31,752 [[32m INFO[m] [[36mmain[m] [[34mc.h.s.d.i.DiscoveryService[m]: [10.154.60.11]:5701 [dev] [4.2] Kubernetes Discovery activated with mode: KUBERNETES_API
2021-04-09 10:39:31,807 [[32m INFO[m] [[36mmain[m] [[34mc.h.i.i.Node[m]: [10.154.60.11]:5701 [dev] [4.2] Using Discovery SPI
2021-04-09 10:39:31,810 [[33m WARN[m] [[36mmain[m] [[34mc.h.c.CPSubsystem[m]: [10.154.60.11]:5701 [dev] [4.2] CP Subsystem is not enabled. CP data structures will operate in UNSAFE mode! Please note that UNSAFE mode will not provide strong consistency guarantees.
2021-04-09 10:39:32,024 [[32m INFO[m] [[36mmain[m] [[34mc.h.i.d.Diagnostics[m]: [10.154.60.11]:5701 [dev] [4.2] Diagnostics disabled. To enable add -Dhazelcast.diagnostics.enabled=true to the JVM arguments.
2021-04-09 10:39:32,035 [[32m INFO[m] [[36mmain[m] [[34mc.h.c.LifecycleService[m]: [10.154.60.11]:5701 [dev] [4.2] [10.154.60.11]:5701 is STARTING
2021-04-09 10:39:32,448 [[32m INFO[m] [[36mmain[m] [[34mc.h.s.d.i.DiscoveryService[m]: [10.154.60.11]:5701 [dev] [4.2] Kubernetes plugin discovered availability zone: XXXXXXXXX
2021-04-09 10:39:32,500 [[32m INFO[m] [[36mmain[m] [[34mc.h.s.d.i.DiscoveryService[m]: [10.154.60.11]:5701 [dev] [4.2] Kubernetes plugin discovered node name: XXXXXXXXX.compute.internal
2021-04-09 10:39:32,616 [[33m WARN[m] [[36mmain[m] [[34mc.h.k.KubernetesClient[m]: Cannot fetch public IPs of Hazelcast Member PODs, you won't be able to use Hazelcast Smart Client from outside of the Kubernetes network
2021-04-09 10:39:37,600 [[32m INFO[m] [[36mmain[m] [[34mc.h.i.c.ClusterService[m]: [10.154.60.11]:5701 [dev] [4.2] 

Members {size:1, ver:1} [
	Member [10.154.60.11]:5701 - 6e43ac32-e6bf-4cfb-81e0-a865d6f986b1 this
]

2021-04-09 10:39:37,611 [[32m INFO[m] [[36mmain[m] [[34mc.h.c.LifecycleService[m]: [10.154.60.11]:5701 [dev] [4.2] [10.154.60.11]:5701 is STARTED
2021-04-09 10:40:22,454 [[32m INFO[m] [[36mhz.boring_ganguly.IO.thread-in-1[m] [[34mc.h.i.s.t.TcpServerConnection[m]: [10.154.60.11]:5701 [dev] [4.2] Initialized new cluster connection between /10.154.60.11:5701 and /10.154.107.219:57097
2021-04-09 10:40:28,455 [[32m INFO[m] [[36mhz.boring_ganguly.priority-generic-operation.thread-0[m] [[34mc.h.i.c.ClusterService[m]: [10.154.60.11]:5701 [dev] [4.2] 

Members {size:2, ver:2} [
	Member [10.154.60.11]:5701 - 6e43ac32-e6bf-4cfb-81e0-a865d6f986b1 this
	Member [10.154.107.219]:5701 - 3140ed16-8832-4e69-a875-d8da79b091eb
]

2021-04-09 10:40:56,850 [[32m INFO[m] [[36mhz.boring_ganguly.IO.thread-in-0[m] [[34mc.h.i.s.t.TcpServerConnection[m]: [10.154.60.11]:5701 [dev] [4.2] Initialized new cluster connection between /10.154.60.11:5701 and /10.154.101.83:44681
2021-04-09 10:41:02,834 [[32m INFO[m] [[36mhz.boring_ganguly.priority-generic-operation.thread-0[m] [[34mc.h.i.c.ClusterService[m]: [10.154.60.11]:5701 [dev] [4.2] 

Members {size:3, ver:3} [
	Member [10.154.60.11]:5701 - 6e43ac32-e6bf-4cfb-81e0-a865d6f986b1 this
	Member [10.154.107.219]:5701 - 3140ed16-8832-4e69-a875-d8da79b091eb
	Member [10.154.101.83]:5701 - 21669cbd-71fc-4a70-8727-b6538ddd3d23
]
  1. We are in AWS EKS

In my dependencies I have only these two that can be related to Hazelcast:

    <dependency>
      <artifactId>spring-boot-starter-cache</artifactId>
      <groupId>org.springframework.boot</groupId>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.hazelcast/hazelcast-all -->
    <dependency>
      <groupId>com.hazelcast</groupId>
      <artifactId>hazelcast-all</artifactId>
      <version>4.2</version>
    </dependency>

My config is in very first message.

Thanks,

tessob avatar Apr 09 '21 11:04 tessob

Short update,

I have tested my application in the minikube locally and everything is OK. So the root-cause of such behavior highly likely in K8s cluster and/or deployment manifests. However, for me still unclear why Hazelcast is trying to establish connection with nodes that are irrelevant to service-name.

tessob avatar Apr 10 '21 11:04 tessob

It may be the phone home call. Could you try disabling phone home and check if it helps?

leszko avatar Apr 14 '21 09:04 leszko

After I disabled it I still have the same behavior in logs, so there should be another cause.

tessob avatar Apr 15 '21 07:04 tessob

Hi @tessob , were you able to find out the cause? I'm currently experiencing the same issue. I'm also in AWS EKS

fallthough avatar Nov 19 '21 07:11 fallthough

I'm also experiencing the same problem and it is also in AWS EKS.

fatih-iver avatar Dec 24 '22 17:12 fatih-iver

Hi @tessob , were you able to find out the cause? I'm currently experiencing the same issue. I'm also in AWS EKS

Sorry for A late reply )) I have migrated to Infinispan.

tessob avatar Dec 25 '22 09:12 tessob

Hey folks,

I am facing same issue . We have java spring app, with embedded hazelcast, and run it in Kubernetes. For member lookup we using DNS Lookup Discovery mode , member creation looks good but Hazelcast trying to connect some random IP and giving below warning.

java.lang.IllegalStateException: Unknown protocol: OPT java.lang.IllegalStateException: TLS handshake header detected, but plain protocol header was expected.

@tessob Did you find the root cause of this? My issue is similar to what you have posted .

AshishWat avatar Dec 13 '23 06:12 AshishWat

@AshishWat No, just forget about Hazelcast after code observation and moved to Infinispan. So far so good.

tessob avatar Jan 23 '24 17:01 tessob