featurebase
featurebase copied to clipboard
Restart of coordinator node cause cluster to stay in "STARTING" state
It's necessary to restart other nodes after coordinator restart too to return cluster to "NORMAL" state.
If pilosa nodes advertise themselves with plain IP address it looks as follows:
2018/10/11 14:47:28 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.33.240:11101
2018/10/11 14:47:48 [DEBUG] memberlist: Failed ping: 81abde46-5895-4328-b363-bf0668416f43 (timeout reached)
2018/10/11 14:47:50 [INFO] memberlist: Suspect 81abde46-5895-4328-b363-bf0668416f43 has failed, no acks received
2018/10/11 14:47:52 [DEBUG] memberlist: Failed ping: 81abde46-5895-4328-b363-bf0668416f43 (timeout reached)
2018/10/11 14:47:53 [INFO] memberlist: Suspect 81abde46-5895-4328-b363-bf0668416f43 has failed, no acks received
2018/10/11 14:47:53 [DEBUG] memberlist: Failed ping: 81abde46-5895-4328-b363-bf0668416f43 (timeout reached)
2018/10/11 14:47:54 [INFO] memberlist: Marking 81abde46-5895-4328-b363-bf0668416f43 as failed, suspect timeout reached (0 peer confirmations)
2018/10/11 14:47:54 received node leave on Node: 64f87e54-944c-42cd-bd23-c799af09acd7: Node: 81abde46-5895-4328-b363-bf0668416f43, uri: http://10.239.33.240:10101
2018/10/11 14:47:54 finished node leave on Node: 64f87e54-944c-42cd-bd23-c799af09acd7: Node: 81abde46-5895-4328-b363-bf0668416f43, uri: http://10.239.33.240:10101
2018/10/11 14:47:56 [INFO] memberlist: Suspect 81abde46-5895-4328-b363-bf0668416f43 has failed, no acks received
2018/10/11 14:47:58 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.52:11101
2018/10/11 14:48:06 [DEBUG] memberlist: Stream connection from=10.239.34.57:42478
2018/10/11 14:48:06 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.33.240:11101 Theirs: 10.239.35.176:11101
2018/10/11 14:48:06 merge cluster status: &{e2b8a9c7-46d0-4041-91f1-76fe29c87815 NORMAL [Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 Node: 64f87e54-944c-42cd-bd23-c799af09acd7 Node: 81abde46-5895-4328-b363-bf0668416f43]}
2018/10/11 14:48:06 add node Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 add node Node: 64f87e54-944c-42cd-bd23-c799af09acd7 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 add node Node: 81abde46-5895-4328-b363-bf0668416f43 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 mark node as joined (received coordinator update)
2018/10/11 14:48:06 merge cluster status: &{e2b8a9c7-46d0-4041-91f1-76fe29c87815 STARTING [Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 Node: 64f87e54-944c-42cd-bd23-c799af09acd7 Node: 81abde46-5895-4328-b363-bf0668416f43]}
2018/10/11 14:48:06 add node Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 add node Node: 64f87e54-944c-42cd-bd23-c799af09acd7 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 add node Node: 81abde46-5895-4328-b363-bf0668416f43 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 change cluster state from NORMAL to STARTING on 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 mark node as joined (received coordinator update)
2018/10/11 14:48:06 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.33.240:11101 Theirs: 10.239.35.176:11101
2018/10/11 14:48:06 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.33.240:11101 Theirs: 10.239.35.176:11101
2018/10/11 14:48:06 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.33.240:11101 Theirs: 10.239.35.176:11101
2018/10/11 14:48:06 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.33.240:11101 Theirs: 10.239.35.176:11101
2018/10/11 14:48:06 [DEBUG] memberlist: Stream connection from=10.239.23.52:49532
2018/10/11 14:48:28 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.52:11101
2018/10/11 14:48:36 [DEBUG] memberlist: Stream connection from=10.239.23.52:49812
2018/10/11 14:48:58 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.52:11101
2018/10/11 14:49:03 received NodeJoin event: &{0 Node: 81abde46-5895-4328-b363-bf0668416f43}
2018/10/11 14:49:06 [DEBUG] memberlist: Stream connection from=10.239.23.52:50102
2018/10/11 14:49:28 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.35.176:11101
2018/10/11 14:49:33 [DEBUG] memberlist: Stream connection from=10.239.35.176:56436
2018/10/11 14:49:36 [DEBUG] memberlist: Stream connection from=10.239.23.52:50382
Sometimes node goes in the way described in #1688 also.
In case of DNS-based advertising:
2018/10/11 16:51:17 [DEBUG] memberlist: Stream connection from=10.239.34.33:32918
2018/10/11 16:51:22 [DEBUG] memberlist: Failed ping: 81abde46-5895-4328-b363-bf0668416f43 (timeout reached)
2018/10/11 16:51:23 [INFO] memberlist: Suspect 81abde46-5895-4328-b363-bf0668416f43 has failed, no acks received
2018/10/11 16:51:24 [DEBUG] memberlist: Failed ping: 81abde46-5895-4328-b363-bf0668416f43 (timeout reached)
2018/10/11 16:51:25 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.58:11101
2018/10/11 16:51:26 [INFO] memberlist: Suspect 81abde46-5895-4328-b363-bf0668416f43 has failed, no acks received
2018/10/11 16:51:27 [INFO] memberlist: Marking 81abde46-5895-4328-b363-bf0668416f43 as failed, suspect timeout reached (0 peer confirmations)
2018/10/11 16:51:27 received node leave on Node: 64f87e54-944c-42cd-bd23-c799af09acd7: Node: 81abde46-5895-4328-b363-bf0668416f43, uri: http://dev-pilosa-0.dev-pilosa-headless.dev-pilosa.svc.cluster.local:10101
2018/10/11 16:51:27 finished node leave on Node: 64f87e54-944c-42cd-bd23-c799af09acd7: Node: 81abde46-5895-4328-b363-bf0668416f43, uri: http://dev-pilosa-0.dev-pilosa-headless.dev-pilosa.svc.cluster.local:10101
2018/10/11 16:51:33 [DEBUG] memberlist: Stream connection from=10.239.23.58:36288
2018/10/11 16:51:45 [DEBUG] memberlist: Stream connection from=10.239.33.171:55024
2018/10/11 16:51:45 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.34.33:11101 Theirs: 10.239.33.171:11101
2018/10/11 16:51:45 merge cluster status: &{e2b8a9c7-46d0-4041-91f1-76fe29c87815 NORMAL [Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 Node: 64f87e54-944c-42cd-bd23-c799af09acd7 Node: 81abde46-5895-4328-b363-bf0668416f43]}
2018/10/11 16:51:45 add node Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 add node Node: 64f87e54-944c-42cd-bd23-c799af09acd7 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 add node Node: 81abde46-5895-4328-b363-bf0668416f43 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 mark node as joined (received coordinator update)
2018/10/11 16:51:45 merge cluster status: &{e2b8a9c7-46d0-4041-91f1-76fe29c87815 STARTING [Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 Node: 64f87e54-944c-42cd-bd23-c799af09acd7 Node: 81abde46-5895-4328-b363-bf0668416f43]}
2018/10/11 16:51:45 add node Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 add node Node: 64f87e54-944c-42cd-bd23-c799af09acd7 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 add node Node: 81abde46-5895-4328-b363-bf0668416f43 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 change cluster state from NORMAL to STARTING on 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 mark node as joined (received coordinator update)
2018/10/11 16:51:45 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.34.33:11101 Theirs: 10.239.33.171:11101
2018/10/11 16:51:45 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.34.33:11101 Theirs: 10.239.33.171:11101
2018/10/11 16:51:46 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.34.33:11101 Theirs: 10.239.33.171:11101
2018/10/11 16:51:46 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.34.33:11101 Theirs: 10.239.33.171:11101
2018/10/11 16:51:55 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.58:11101
2018/10/11 16:52:03 [DEBUG] memberlist: Stream connection from=10.239.23.58:36570
2018/10/11 16:52:25 received NodeJoin event: &{0 Node: 81abde46-5895-4328-b363-bf0668416f43}
2018/10/11 16:52:25 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.58:11101
Configuration equal to #1687 #1688
related: #1585
possible this was resolved by #1717 - need to confirm
@jaffee I have some questions on what happen if coordinator node failure occur.
- Does the cluster deny all read/write requests?
- Is
Changing the Coordinator
still possible?
@yuzhichang yes, it should be possible to set a new coordinator when the coordinator is down. Honestly though, I'm not sure we have a test for this particular case, so you may want to play with it a bit.
The cluster should still be able to respond to most requests when the coordinator is down (assuming replication is > 1) - the major exceptions would be things like cluster resize requests.