featurebase icon indicating copy to clipboard operation
featurebase copied to clipboard

Restart of coordinator node cause cluster to stay in "STARTING" state

Open dene14 opened this issue 6 years ago • 4 comments

It's necessary to restart other nodes after coordinator restart too to return cluster to "NORMAL" state.

If pilosa nodes advertise themselves with plain IP address it looks as follows:

2018/10/11 14:47:28 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.33.240:11101
2018/10/11 14:47:48 [DEBUG] memberlist: Failed ping: 81abde46-5895-4328-b363-bf0668416f43 (timeout reached)
2018/10/11 14:47:50 [INFO] memberlist: Suspect 81abde46-5895-4328-b363-bf0668416f43 has failed, no acks received
2018/10/11 14:47:52 [DEBUG] memberlist: Failed ping: 81abde46-5895-4328-b363-bf0668416f43 (timeout reached)
2018/10/11 14:47:53 [INFO] memberlist: Suspect 81abde46-5895-4328-b363-bf0668416f43 has failed, no acks received
2018/10/11 14:47:53 [DEBUG] memberlist: Failed ping: 81abde46-5895-4328-b363-bf0668416f43 (timeout reached)
2018/10/11 14:47:54 [INFO] memberlist: Marking 81abde46-5895-4328-b363-bf0668416f43 as failed, suspect timeout reached (0 peer confirmations)
2018/10/11 14:47:54 received node leave on Node: 64f87e54-944c-42cd-bd23-c799af09acd7: Node: 81abde46-5895-4328-b363-bf0668416f43, uri: http://10.239.33.240:10101
2018/10/11 14:47:54 finished node leave on Node: 64f87e54-944c-42cd-bd23-c799af09acd7: Node: 81abde46-5895-4328-b363-bf0668416f43, uri: http://10.239.33.240:10101
2018/10/11 14:47:56 [INFO] memberlist: Suspect 81abde46-5895-4328-b363-bf0668416f43 has failed, no acks received
2018/10/11 14:47:58 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.52:11101
2018/10/11 14:48:06 [DEBUG] memberlist: Stream connection from=10.239.34.57:42478
2018/10/11 14:48:06 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.33.240:11101 Theirs: 10.239.35.176:11101
2018/10/11 14:48:06 merge cluster status: &{e2b8a9c7-46d0-4041-91f1-76fe29c87815 NORMAL [Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 Node: 64f87e54-944c-42cd-bd23-c799af09acd7 Node: 81abde46-5895-4328-b363-bf0668416f43]}
2018/10/11 14:48:06 add node Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 add node Node: 64f87e54-944c-42cd-bd23-c799af09acd7 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 add node Node: 81abde46-5895-4328-b363-bf0668416f43 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 mark node as joined (received coordinator update)
2018/10/11 14:48:06 merge cluster status: &{e2b8a9c7-46d0-4041-91f1-76fe29c87815 STARTING [Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 Node: 64f87e54-944c-42cd-bd23-c799af09acd7 Node: 81abde46-5895-4328-b363-bf0668416f43]}
2018/10/11 14:48:06 add node Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 add node Node: 64f87e54-944c-42cd-bd23-c799af09acd7 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 add node Node: 81abde46-5895-4328-b363-bf0668416f43 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 change cluster state from NORMAL to STARTING on 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 14:48:06 mark node as joined (received coordinator update)
2018/10/11 14:48:06 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.33.240:11101 Theirs: 10.239.35.176:11101
2018/10/11 14:48:06 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.33.240:11101 Theirs: 10.239.35.176:11101
2018/10/11 14:48:06 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.33.240:11101 Theirs: 10.239.35.176:11101
2018/10/11 14:48:06 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.33.240:11101 Theirs: 10.239.35.176:11101
2018/10/11 14:48:06 [DEBUG] memberlist: Stream connection from=10.239.23.52:49532
2018/10/11 14:48:28 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.52:11101
2018/10/11 14:48:36 [DEBUG] memberlist: Stream connection from=10.239.23.52:49812
2018/10/11 14:48:58 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.52:11101
2018/10/11 14:49:03 received NodeJoin event: &{0 Node: 81abde46-5895-4328-b363-bf0668416f43}
2018/10/11 14:49:06 [DEBUG] memberlist: Stream connection from=10.239.23.52:50102
2018/10/11 14:49:28 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.35.176:11101
2018/10/11 14:49:33 [DEBUG] memberlist: Stream connection from=10.239.35.176:56436
2018/10/11 14:49:36 [DEBUG] memberlist: Stream connection from=10.239.23.52:50382

Sometimes node goes in the way described in #1688 also.

In case of DNS-based advertising:

2018/10/11 16:51:17 [DEBUG] memberlist: Stream connection from=10.239.34.33:32918
2018/10/11 16:51:22 [DEBUG] memberlist: Failed ping: 81abde46-5895-4328-b363-bf0668416f43 (timeout reached)
2018/10/11 16:51:23 [INFO] memberlist: Suspect 81abde46-5895-4328-b363-bf0668416f43 has failed, no acks received
2018/10/11 16:51:24 [DEBUG] memberlist: Failed ping: 81abde46-5895-4328-b363-bf0668416f43 (timeout reached)
2018/10/11 16:51:25 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.58:11101
2018/10/11 16:51:26 [INFO] memberlist: Suspect 81abde46-5895-4328-b363-bf0668416f43 has failed, no acks received
2018/10/11 16:51:27 [INFO] memberlist: Marking 81abde46-5895-4328-b363-bf0668416f43 as failed, suspect timeout reached (0 peer confirmations)
2018/10/11 16:51:27 received node leave on Node: 64f87e54-944c-42cd-bd23-c799af09acd7: Node: 81abde46-5895-4328-b363-bf0668416f43, uri: http://dev-pilosa-0.dev-pilosa-headless.dev-pilosa.svc.cluster.local:10101
2018/10/11 16:51:27 finished node leave on Node: 64f87e54-944c-42cd-bd23-c799af09acd7: Node: 81abde46-5895-4328-b363-bf0668416f43, uri: http://dev-pilosa-0.dev-pilosa-headless.dev-pilosa.svc.cluster.local:10101
2018/10/11 16:51:33 [DEBUG] memberlist: Stream connection from=10.239.23.58:36288
2018/10/11 16:51:45 [DEBUG] memberlist: Stream connection from=10.239.33.171:55024
2018/10/11 16:51:45 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.34.33:11101 Theirs: 10.239.33.171:11101
2018/10/11 16:51:45 merge cluster status: &{e2b8a9c7-46d0-4041-91f1-76fe29c87815 NORMAL [Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 Node: 64f87e54-944c-42cd-bd23-c799af09acd7 Node: 81abde46-5895-4328-b363-bf0668416f43]}
2018/10/11 16:51:45 add node Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 add node Node: 64f87e54-944c-42cd-bd23-c799af09acd7 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 add node Node: 81abde46-5895-4328-b363-bf0668416f43 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 mark node as joined (received coordinator update)
2018/10/11 16:51:45 merge cluster status: &{e2b8a9c7-46d0-4041-91f1-76fe29c87815 STARTING [Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 Node: 64f87e54-944c-42cd-bd23-c799af09acd7 Node: 81abde46-5895-4328-b363-bf0668416f43]}
2018/10/11 16:51:45 add node Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 add node Node: 64f87e54-944c-42cd-bd23-c799af09acd7 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 add node Node: 81abde46-5895-4328-b363-bf0668416f43 to cluster on Node: 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 change cluster state from NORMAL to STARTING on 64f87e54-944c-42cd-bd23-c799af09acd7
2018/10/11 16:51:45 mark node as joined (received coordinator update)
2018/10/11 16:51:45 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.34.33:11101 Theirs: 10.239.33.171:11101
2018/10/11 16:51:45 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.34.33:11101 Theirs: 10.239.33.171:11101
2018/10/11 16:51:46 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.34.33:11101 Theirs: 10.239.33.171:11101
2018/10/11 16:51:46 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.34.33:11101 Theirs: 10.239.33.171:11101
2018/10/11 16:51:55 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.58:11101
2018/10/11 16:52:03 [DEBUG] memberlist: Stream connection from=10.239.23.58:36570
2018/10/11 16:52:25 received NodeJoin event: &{0 Node: 81abde46-5895-4328-b363-bf0668416f43}
2018/10/11 16:52:25 [DEBUG] memberlist: Initiating push/pull sync with: 10.239.23.58:11101

Configuration equal to #1687 #1688

dene14 avatar Oct 11 '18 16:10 dene14

related: #1585

travisturner avatar Oct 11 '18 17:10 travisturner

possible this was resolved by #1717 - need to confirm

jaffee avatar Nov 26 '18 16:11 jaffee

@jaffee I have some questions on what happen if coordinator node failure occur.

yuzhichang avatar Feb 02 '19 09:02 yuzhichang

@yuzhichang yes, it should be possible to set a new coordinator when the coordinator is down. Honestly though, I'm not sure we have a test for this particular case, so you may want to play with it a bit.

The cluster should still be able to respond to most requests when the coordinator is down (assuming replication is > 1) - the major exceptions would be things like cluster resize requests.

jaffee avatar Feb 04 '19 20:02 jaffee