temporal
temporal copied to clipboard
Packet drops after restarting Temporal Cluster
Expected Behavior
We have been observing network packet drop (stale_or_unroutable_ip cause), whenever we restart our Temporal Clusters (on new deployments, releases, etc.).
Drops:
The old IP addresses of the Temporal services are still in the cluster_membership table:
temporal=> SELECT * from cluster_membership;
membership_partition | host_id | rpc_address | rpc_port | role | session_start | last_heartbeat | record_expiry
----------------------+------------------------------------+---------------+----------+------+----------------------------+----------------------------+----------------------------
0 | \x27f0726cc4f511ee8c4a7e886be421e8 | 172.16.36.211 | 6933 | 1 | 2024-02-06 13:39:31.178854 | 2024-02-06 13:54:17.480887 | 2024-02-08 13:54:17.480887
0 | \x406ee616c4f511ee9c2c4eff8ed03845 | 172.16.34.153 | 6934 | 2 | 2024-02-06 13:40:12.256704 | 2024-02-06 13:54:18.532144 | 2024-02-08 13:54:18.532144
0 | \x9859b450c1a611eeb02a968626816f6d | 172.16.36.100 | 6939 | 4 | 2024-02-02 08:39:36.020235 | 2024-02-06 13:39:21.505523 | 2024-02-08 13:39:21.505523
0 | \x8e3abc89c1a611ee860fba5a10a03e3e | 172.16.34.175 | 6933 | 1 | 2024-02-02 08:39:19.077945 | 2024-02-06 13:39:21.516706 | 2024-02-08 13:39:21.516706
0 | \x2737ad64c4f511ee9e3dea8b31a7ca50 | 172.16.37.76 | 6935 | 3 | 2024-02-06 13:39:29.943083 | 2024-02-06 13:54:19.267353 | 2024-02-08 13:54:19.267353
0 | \x8bf5d329c1a611eeab2d1663815337d5 | 172.16.25.71 | 6934 | 2 | 2024-02-02 08:39:15.230349 | 2024-02-06 13:39:22.505874 | 2024-02-08 13:39:22.505874
0 | \x2767ad39c4f511eeb5b4e2f2f4ca461b | 172.16.33.254 | 6934 | 2 | 2024-02-06 13:39:30.276643 | 2024-02-06 13:54:19.615724 | 2024-02-08 13:54:19.615724
0 | \x8cf287a6c1a611eea8ee2630adf79dee | 172.16.25.155 | 6933 | 1 | 2024-02-02 08:39:16.920261 | 2024-02-06 13:39:24.229088 | 2024-02-08 13:39:24.229088
0 | \x8d2695a8c1a611eeaf43ba9bd9d72b37 | 172.16.34.249 | 6936 | 5 | 2024-02-02 08:39:17.234805 | 2024-02-06 13:39:24.928443 | 2024-02-08 13:39:24.928443
0 | \x9b785481c1a611ee8e1b626cb429a928 | 172.16.37.235 | 6933 | 1 | 2024-02-02 08:39:41.263783 | 2024-02-06 13:39:25.923658 | 2024-02-08 13:39:25.923658
0 | \x26343fa1c4f511eebb64d26e62cd671b | 172.16.23.217 | 6933 | 1 | 2024-02-06 13:39:28.256783 | 2024-02-06 13:54:19.640885 | 2024-02-08 13:54:19.640885
0 | \x2ec9f65fc4f511eebb0c42b810af6adf | 172.16.10.37 | 6933 | 1 | 2024-02-06 13:39:42.645453 | 2024-02-06 13:54:19.924845 | 2024-02-08 13:54:19.924845
0 | \x3546754ac4f511eeb0df5e406aef04b7 | 172.16.9.44 | 6939 | 4 | 2024-02-06 13:39:53.52417 | 2024-02-06 13:54:21.815246 | 2024-02-08 13:54:21.815246
0 | \x9b761baac1a611ee86861ee633e6ff6e | 172.16.37.57 | 6934 | 2 | 2024-02-02 08:39:41.249319 | 2024-02-06 13:40:03.892467 | 2024-02-08 13:40:03.892467
0 | \x2bd60bffc4f511ee8038f2a15c53c361 | 172.16.33.218 | 6933 | 1 | 2024-02-06 13:39:37.706043 | 2024-02-06 13:54:22.096942 | 2024-02-08 13:54:22.096942
0 | \x2e034424c4f511ee8b8422fe0b9db541 | 172.16.2.98 | 6939 | 4 | 2024-02-06 13:39:41.342732 | 2024-02-06 13:54:22.630096 | 2024-02-08 13:54:22.630096
0 | \x264e5763c4f511eeb6516a80b6635bad | 172.16.8.104 | 6936 | 5 | 2024-02-06 13:39:28.414856 | 2024-02-06 13:54:23.857224 | 2024-02-08 13:54:23.857224
0 | \x8d48423fc1a611eeb3267a0ba9253295 | 172.16.33.56 | 6936 | 5 | 2024-02-02 08:39:17.478502 | 2024-02-06 13:39:28.570702 | 2024-02-08 13:39:28.570702
0 | \x2a052748c4f511eea9d3be82b1fb36b9 | 172.16.23.57 | 6936 | 5 | 2024-02-06 13:39:34.658886 | 2024-02-06 13:54:25.05269 | 2024-02-08 13:54:25.05269
0 | \x30ba0b59c4f511ee9640622359ce980d | 172.16.36.75 | 6935 | 3 | 2024-02-06 13:39:45.89489 | 2024-02-06 13:54:25.22052 | 2024-02-08 13:54:25.22052
0 | \x2ef49a0cc4f511ee8241c67c007be2c5 | 172.16.11.190 | 6935 | 3 | 2024-02-06 13:39:42.933887 | 2024-02-06 13:54:25.330191 | 2024-02-08 13:54:25.330191
0 | \x44663929c4f511eea6d4c6614722b5d8 | 172.16.37.45 | 6934 | 2 | 2024-02-06 13:40:18.911727 | 2024-02-06 13:54:26.233305 | 2024-02-08 13:54:26.233305
0 | \x8fb19965c1a611ee8a3daac84ef7ce0e | 172.16.33.10 | 6935 | 3 | 2024-02-02 08:39:21.508819 | 2024-02-06 13:39:31.954545 | 2024-02-08 13:39:31.954545
0 | \x903b6e8bc1a611ee95c81e7a8687d124 | 172.16.25.182 | 6939 | 4 | 2024-02-02 08:39:22.394533 | 2024-02-06 13:39:36.146894 | 2024-02-08 13:39:36.146894
0 | \x9b7215a2c1a611ee95cb16a506174482 | 172.16.37.81 | 6939 | 4 | 2024-02-02 08:39:41.208757 | 2024-02-06 13:39:40.488594 | 2024-02-08 13:39:40.488594
0 | \x8f9e505bc1a611ee86e66e9c4e8dfa7b | 172.16.34.121 | 6935 | 3 | 2024-02-02 08:39:21.388292 | 2024-02-06 13:39:41.75921 | 2024-02-08 13:39:41.75921
0 | \x90305675c1a611eeaf185a86e9283977 | 172.16.25.153 | 6935 | 3 | 2024-02-02 08:39:22.317386 | 2024-02-06 13:39:42.811889 | 2024-02-08 13:39:42.811889
0 | \x8c09ae21c1a611ee90916e2ec790e4bc | 172.16.33.9 | 6934 | 2 | 2024-02-02 08:39:15.388081 | 2024-02-06 13:39:04.497511 | 2024-02-08 13:39:04.497511
0 | \x8c135f47c1a611ee8f015a97772859d3 | 172.16.33.97 | 6933 | 1 | 2024-02-02 08:39:15.460704 | 2024-02-06 13:39:13.370839 | 2024-02-08 13:39:13.370839
0 | \x8d28167fc1a611eeb2f7f225c06c088b | 172.16.34.92 | 6934 | 2 | 2024-02-02 08:39:17.259959 | 2024-02-06 13:39:46.189244 | 2024-02-08 13:39:46.189244
0 | \x2633938fc4f511eebef86e735020f6ec | 172.16.23.84 | 6939 | 4 | 2024-02-06 13:39:28.253228 | 2024-02-06 13:54:15.639278 | 2024-02-08 13:54:15.639278
0 | \x34a909b4c4f511eebde066f106d6cd31 | 172.16.25.55 | 6934 | 2 | 2024-02-06 13:39:52.50008 | 2024-02-06 13:54:15.744222 | 2024-02-08 13:54:15.744222
(32 rows)
temporal-prod-operator-6b9bf85f75-mp4h4:/$ tctl adm cl d
{
"supportedClients": {
"temporal-cli": "\u003c2.0.0",
"temporal-go": "\u003c2.0.0",
"temporal-java": "\u003c2.0.0",
"temporal-php": "\u003c2.0.0",
"temporal-server": "\u003c2.0.0",
"temporal-typescript": "\u003c2.0.0",
"temporal-ui": "\u003c3.0.0"
},
"serverVersion": "1.22.4",
"membershipInfo": {
"currentHost": {
"identity": "172.16.10.37:7233"
},
"reachableMembers": [
"172.16.25.55:6934",
"172.16.8.104:6936",
"172.16.37.76:6935",
"172.16.9.44:6939",
"172.16.23.57:6936",
"172.16.37.45:6934",
"172.16.2.98:6939",
"172.16.36.75:6935",
"172.16.10.37:6933",
"172.16.23.217:6933",
"172.16.33.218:6933",
"172.16.34.153:6934",
"172.16.36.211:6933",
"172.16.11.190:6935",
"172.16.23.84:6939",
"172.16.33.254:6934"
],
"rings": [
{
"role": "frontend",
"memberCount": 4,
"members": [
{
"identity": "172.16.23.217:7233"
},
{
"identity": "172.16.36.211:7233"
},
{
"identity": "172.16.33.218:7233"
},
{
"identity": "172.16.10.37:7233"
}
]
},
{
"role": "internal-frontend",
"memberCount": 2,
"members": [
{
"identity": "172.16.8.104:7236"
},
{
"identity": "172.16.23.57:7236"
}
]
},
{
"role": "history",
"memberCount": 4,
"members": [
{
"identity": "172.16.33.254:7234"
},
{
"identity": "172.16.34.153:7234"
},
{
"identity": "172.16.25.55:7234"
},
{
"identity": "172.16.37.45:7234"
}
]
},
{
"role": "matching",
"memberCount": 3,
"members": [
{
"identity": "172.16.11.190:7235"
},
{
"identity": "172.16.37.76:7235"
},
{
"identity": "172.16.36.75:7235"
}
]
},
{
"role": "worker",
"memberCount": 3,
"members": [
{
"identity": "172.16.23.84:7239"
},
{
"identity": "172.16.2.98:7239"
},
{
"identity": "172.16.9.44:7239"
}
]
}
]
},
"clusterId": "25cb4f70-a15f-4ae8-81bd-ddb68242a8eb",
"clusterName": "active",
"historyShardCount": 8192,
"persistenceStore": "postgres",
"visibilityStore": "postgres",
"failoverVersionIncrement": "10",
"initialFailoverVersion": "1"
}
Actual Behavior
I expect the old IPs to be removed from the cluster_membership table.
Steps to Reproduce the Problem
- Restart temporal services
Specifications
- Version: 1.22.4
- Platform: linux
The only way to resolve this issue is by:
- Scaling down the temporal services to zero replicas
- Clearing
cluster_membershiptable - Scaling temporal services back up
Old ips in cluster_membership are expected and harmless. They're only used to bootstrap ringpop and then ringpop is used for membership information. Only rows that were updated in the last 20 seconds are actually used, older rows are kept around for debugging and will be removed after 48 hours.
So, Why are the temporal services using the old IP addresses and consequently causing these packet drops?
it seems these old IP addresses are also stored in memory by each services and not removed when they are not longer reachable.
There is a cache from ip address to grpc connection in each service, and it's true that entries aren't removed from that cache, so it's possible grpc is still trying to do some kind of heartbeat to them. This shouldn't cause any problems though.
The drops are not going way... here is an example:
Is there any observable effect on the operation of the cluster?
I just think it doesn't make sense to be spamming the cluster with network connections that are dropped / not required.
The old IPs should be invalidated from the cache/db after a few minutes (not hours or days).