temporal icon indicating copy to clipboard operation
temporal copied to clipboard

Packet drops after restarting Temporal Cluster

Open andreclaro opened this issue 1 year ago • 5 comments

Expected Behavior

We have been observing network packet drop (stale_or_unroutable_ip cause), whenever we restart our Temporal Clusters (on new deployments, releases, etc.).

Drops: image

The old IP addresses of the Temporal services are still in the cluster_membership table:

temporal=> SELECT * from cluster_membership;
 membership_partition |              host_id               |  rpc_address  | rpc_port | role |       session_start        |       last_heartbeat       |       record_expiry
----------------------+------------------------------------+---------------+----------+------+----------------------------+----------------------------+----------------------------
                    0 | \x27f0726cc4f511ee8c4a7e886be421e8 | 172.16.36.211 |     6933 |    1 | 2024-02-06 13:39:31.178854 | 2024-02-06 13:54:17.480887 | 2024-02-08 13:54:17.480887
                    0 | \x406ee616c4f511ee9c2c4eff8ed03845 | 172.16.34.153 |     6934 |    2 | 2024-02-06 13:40:12.256704 | 2024-02-06 13:54:18.532144 | 2024-02-08 13:54:18.532144
                    0 | \x9859b450c1a611eeb02a968626816f6d | 172.16.36.100 |     6939 |    4 | 2024-02-02 08:39:36.020235 | 2024-02-06 13:39:21.505523 | 2024-02-08 13:39:21.505523
                    0 | \x8e3abc89c1a611ee860fba5a10a03e3e | 172.16.34.175 |     6933 |    1 | 2024-02-02 08:39:19.077945 | 2024-02-06 13:39:21.516706 | 2024-02-08 13:39:21.516706
                    0 | \x2737ad64c4f511ee9e3dea8b31a7ca50 | 172.16.37.76  |     6935 |    3 | 2024-02-06 13:39:29.943083 | 2024-02-06 13:54:19.267353 | 2024-02-08 13:54:19.267353
                    0 | \x8bf5d329c1a611eeab2d1663815337d5 | 172.16.25.71  |     6934 |    2 | 2024-02-02 08:39:15.230349 | 2024-02-06 13:39:22.505874 | 2024-02-08 13:39:22.505874
                    0 | \x2767ad39c4f511eeb5b4e2f2f4ca461b | 172.16.33.254 |     6934 |    2 | 2024-02-06 13:39:30.276643 | 2024-02-06 13:54:19.615724 | 2024-02-08 13:54:19.615724
                    0 | \x8cf287a6c1a611eea8ee2630adf79dee | 172.16.25.155 |     6933 |    1 | 2024-02-02 08:39:16.920261 | 2024-02-06 13:39:24.229088 | 2024-02-08 13:39:24.229088
                    0 | \x8d2695a8c1a611eeaf43ba9bd9d72b37 | 172.16.34.249 |     6936 |    5 | 2024-02-02 08:39:17.234805 | 2024-02-06 13:39:24.928443 | 2024-02-08 13:39:24.928443
                    0 | \x9b785481c1a611ee8e1b626cb429a928 | 172.16.37.235 |     6933 |    1 | 2024-02-02 08:39:41.263783 | 2024-02-06 13:39:25.923658 | 2024-02-08 13:39:25.923658
                    0 | \x26343fa1c4f511eebb64d26e62cd671b | 172.16.23.217 |     6933 |    1 | 2024-02-06 13:39:28.256783 | 2024-02-06 13:54:19.640885 | 2024-02-08 13:54:19.640885
                    0 | \x2ec9f65fc4f511eebb0c42b810af6adf | 172.16.10.37  |     6933 |    1 | 2024-02-06 13:39:42.645453 | 2024-02-06 13:54:19.924845 | 2024-02-08 13:54:19.924845
                    0 | \x3546754ac4f511eeb0df5e406aef04b7 | 172.16.9.44   |     6939 |    4 | 2024-02-06 13:39:53.52417  | 2024-02-06 13:54:21.815246 | 2024-02-08 13:54:21.815246
                    0 | \x9b761baac1a611ee86861ee633e6ff6e | 172.16.37.57  |     6934 |    2 | 2024-02-02 08:39:41.249319 | 2024-02-06 13:40:03.892467 | 2024-02-08 13:40:03.892467
                    0 | \x2bd60bffc4f511ee8038f2a15c53c361 | 172.16.33.218 |     6933 |    1 | 2024-02-06 13:39:37.706043 | 2024-02-06 13:54:22.096942 | 2024-02-08 13:54:22.096942
                    0 | \x2e034424c4f511ee8b8422fe0b9db541 | 172.16.2.98   |     6939 |    4 | 2024-02-06 13:39:41.342732 | 2024-02-06 13:54:22.630096 | 2024-02-08 13:54:22.630096
                    0 | \x264e5763c4f511eeb6516a80b6635bad | 172.16.8.104  |     6936 |    5 | 2024-02-06 13:39:28.414856 | 2024-02-06 13:54:23.857224 | 2024-02-08 13:54:23.857224
                    0 | \x8d48423fc1a611eeb3267a0ba9253295 | 172.16.33.56  |     6936 |    5 | 2024-02-02 08:39:17.478502 | 2024-02-06 13:39:28.570702 | 2024-02-08 13:39:28.570702
                    0 | \x2a052748c4f511eea9d3be82b1fb36b9 | 172.16.23.57  |     6936 |    5 | 2024-02-06 13:39:34.658886 | 2024-02-06 13:54:25.05269  | 2024-02-08 13:54:25.05269
                    0 | \x30ba0b59c4f511ee9640622359ce980d | 172.16.36.75  |     6935 |    3 | 2024-02-06 13:39:45.89489  | 2024-02-06 13:54:25.22052  | 2024-02-08 13:54:25.22052
                    0 | \x2ef49a0cc4f511ee8241c67c007be2c5 | 172.16.11.190 |     6935 |    3 | 2024-02-06 13:39:42.933887 | 2024-02-06 13:54:25.330191 | 2024-02-08 13:54:25.330191
                    0 | \x44663929c4f511eea6d4c6614722b5d8 | 172.16.37.45  |     6934 |    2 | 2024-02-06 13:40:18.911727 | 2024-02-06 13:54:26.233305 | 2024-02-08 13:54:26.233305
                    0 | \x8fb19965c1a611ee8a3daac84ef7ce0e | 172.16.33.10  |     6935 |    3 | 2024-02-02 08:39:21.508819 | 2024-02-06 13:39:31.954545 | 2024-02-08 13:39:31.954545
                    0 | \x903b6e8bc1a611ee95c81e7a8687d124 | 172.16.25.182 |     6939 |    4 | 2024-02-02 08:39:22.394533 | 2024-02-06 13:39:36.146894 | 2024-02-08 13:39:36.146894
                    0 | \x9b7215a2c1a611ee95cb16a506174482 | 172.16.37.81  |     6939 |    4 | 2024-02-02 08:39:41.208757 | 2024-02-06 13:39:40.488594 | 2024-02-08 13:39:40.488594
                    0 | \x8f9e505bc1a611ee86e66e9c4e8dfa7b | 172.16.34.121 |     6935 |    3 | 2024-02-02 08:39:21.388292 | 2024-02-06 13:39:41.75921  | 2024-02-08 13:39:41.75921
                    0 | \x90305675c1a611eeaf185a86e9283977 | 172.16.25.153 |     6935 |    3 | 2024-02-02 08:39:22.317386 | 2024-02-06 13:39:42.811889 | 2024-02-08 13:39:42.811889
                    0 | \x8c09ae21c1a611ee90916e2ec790e4bc | 172.16.33.9   |     6934 |    2 | 2024-02-02 08:39:15.388081 | 2024-02-06 13:39:04.497511 | 2024-02-08 13:39:04.497511
                    0 | \x8c135f47c1a611ee8f015a97772859d3 | 172.16.33.97  |     6933 |    1 | 2024-02-02 08:39:15.460704 | 2024-02-06 13:39:13.370839 | 2024-02-08 13:39:13.370839
                    0 | \x8d28167fc1a611eeb2f7f225c06c088b | 172.16.34.92  |     6934 |    2 | 2024-02-02 08:39:17.259959 | 2024-02-06 13:39:46.189244 | 2024-02-08 13:39:46.189244
                    0 | \x2633938fc4f511eebef86e735020f6ec | 172.16.23.84  |     6939 |    4 | 2024-02-06 13:39:28.253228 | 2024-02-06 13:54:15.639278 | 2024-02-08 13:54:15.639278
                    0 | \x34a909b4c4f511eebde066f106d6cd31 | 172.16.25.55  |     6934 |    2 | 2024-02-06 13:39:52.50008  | 2024-02-06 13:54:15.744222 | 2024-02-08 13:54:15.744222
(32 rows)
temporal-prod-operator-6b9bf85f75-mp4h4:/$ tctl adm cl d
{
  "supportedClients": {
    "temporal-cli": "\u003c2.0.0",
    "temporal-go": "\u003c2.0.0",
    "temporal-java": "\u003c2.0.0",
    "temporal-php": "\u003c2.0.0",
    "temporal-server": "\u003c2.0.0",
    "temporal-typescript": "\u003c2.0.0",
    "temporal-ui": "\u003c3.0.0"
  },
  "serverVersion": "1.22.4",
  "membershipInfo": {
    "currentHost": {
      "identity": "172.16.10.37:7233"
    },
    "reachableMembers": [
      "172.16.25.55:6934",
      "172.16.8.104:6936",
      "172.16.37.76:6935",
      "172.16.9.44:6939",
      "172.16.23.57:6936",
      "172.16.37.45:6934",
      "172.16.2.98:6939",
      "172.16.36.75:6935",
      "172.16.10.37:6933",
      "172.16.23.217:6933",
      "172.16.33.218:6933",
      "172.16.34.153:6934",
      "172.16.36.211:6933",
      "172.16.11.190:6935",
      "172.16.23.84:6939",
      "172.16.33.254:6934"
    ],
    "rings": [
      {
        "role": "frontend",
        "memberCount": 4,
        "members": [
          {
            "identity": "172.16.23.217:7233"
          },
          {
            "identity": "172.16.36.211:7233"
          },
          {
            "identity": "172.16.33.218:7233"
          },
          {
            "identity": "172.16.10.37:7233"
          }
        ]
      },
      {
        "role": "internal-frontend",
        "memberCount": 2,
        "members": [
          {
            "identity": "172.16.8.104:7236"
          },
          {
            "identity": "172.16.23.57:7236"
          }
        ]
      },
      {
        "role": "history",
        "memberCount": 4,
        "members": [
          {
            "identity": "172.16.33.254:7234"
          },
          {
            "identity": "172.16.34.153:7234"
          },
          {
            "identity": "172.16.25.55:7234"
          },
          {
            "identity": "172.16.37.45:7234"
          }
        ]
      },
      {
        "role": "matching",
        "memberCount": 3,
        "members": [
          {
            "identity": "172.16.11.190:7235"
          },
          {
            "identity": "172.16.37.76:7235"
          },
          {
            "identity": "172.16.36.75:7235"
          }
        ]
      },
      {
        "role": "worker",
        "memberCount": 3,
        "members": [
          {
            "identity": "172.16.23.84:7239"
          },
          {
            "identity": "172.16.2.98:7239"
          },
          {
            "identity": "172.16.9.44:7239"
          }
        ]
      }
    ]
  },
  "clusterId": "25cb4f70-a15f-4ae8-81bd-ddb68242a8eb",
  "clusterName": "active",
  "historyShardCount": 8192,
  "persistenceStore": "postgres",
  "visibilityStore": "postgres",
  "failoverVersionIncrement": "10",
  "initialFailoverVersion": "1"
}

Actual Behavior

I expect the old IPs to be removed from the cluster_membership table.

Steps to Reproduce the Problem

  1. Restart temporal services

Specifications

  • Version: 1.22.4
  • Platform: linux

andreclaro avatar Feb 06 '24 13:02 andreclaro

The only way to resolve this issue is by:

  • Scaling down the temporal services to zero replicas
  • Clearing cluster_membership table
  • Scaling temporal services back up

andreclaro avatar Feb 06 '24 14:02 andreclaro

Old ips in cluster_membership are expected and harmless. They're only used to bootstrap ringpop and then ringpop is used for membership information. Only rows that were updated in the last 20 seconds are actually used, older rows are kept around for debugging and will be removed after 48 hours.

dnr avatar Feb 07 '24 00:02 dnr

So, Why are the temporal services using the old IP addresses and consequently causing these packet drops?

image

andreclaro avatar Feb 07 '24 22:02 andreclaro

it seems these old IP addresses are also stored in memory by each services and not removed when they are not longer reachable.

andreclaro avatar Feb 09 '24 12:02 andreclaro

There is a cache from ip address to grpc connection in each service, and it's true that entries aren't removed from that cache, so it's possible grpc is still trying to do some kind of heartbeat to them. This shouldn't cause any problems though.

dnr avatar Feb 09 '24 22:02 dnr

The drops are not going way... here is an example:

image

andreclaro avatar Feb 27 '24 17:02 andreclaro

Is there any observable effect on the operation of the cluster?

dnr avatar Feb 28 '24 03:02 dnr

I just think it doesn't make sense to be spamming the cluster with network connections that are dropped / not required.

The old IPs should be invalidated from the cache/db after a few minutes (not hours or days).

andreclaro avatar Mar 04 '24 22:03 andreclaro