featurebase
featurebase copied to clipboard
Pilosa clustering isn't resilient to IP address changes
Pilosa doesn't try to re-initiate connection to replaced node.
There is a brainsplit in /status lists for this node (it lists old IP).
I think it should check for information in other nodes and probe another IP for the same nodeId.
2018/10/11 14:55:40 mark node as joined (received coordinator update)
2018/10/11 14:55:40 merge cluster status: &{e2b8a9c7-46d0-4041-91f1-76fe29c87815 STARTING [Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 Node: 64f87e54-944c-42cd-bd23-c799af09acd7 Node: 81abde46-5895-4328-b363-bf0668416f43]}
2018/10/11 14:55:40 add node Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18 to cluster on Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18
2018/10/11 14:55:40 add node Node: 64f87e54-944c-42cd-bd23-c799af09acd7 to cluster on Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18
2018/10/11 14:55:40 add node Node: 81abde46-5895-4328-b363-bf0668416f43 to cluster on Node: 64bd7246-75e4-422e-8715-e2c3dc1cca18
2018/10/11 14:55:40 change cluster state from NORMAL to STARTING on 64bd7246-75e4-422e-8715-e2c3dc1cca18
2018/10/11 14:55:40 mark node as joined (received coordinator update)
2018/10/11 14:55:40 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.35.176:11101 Theirs: 10.239.34.33:11101
2018/10/11 14:55:40 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.35.176:11101 Theirs: 10.239.34.33:11101
2018/10/11 14:55:40 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.35.176:11101 Theirs: 10.239.34.33:11101
2018/10/11 14:55:40 [ERR] memberlist: Conflicting address for 81abde46-5895-4328-b363-bf0668416f43. Mine: 10.239.35.176:11101 Theirs: 10.239.34.33:11101
2018/10/11 14:55:43 pilosa: replication error: http: cannot connect to translate store endpoint: Get http://10.239.35.176:10101/internal/translate/data?offset=0: dial tcp 10.239.35.176:10101: connect: no route to host
2018/10/11 14:55:44 [DEBUG] memberlist: Stream connection from=10.239.8.208:38514
2018/10/11 14:55:44 pilosa: reconnecting to primary replica
2018/10/11 14:55:44 pilosa: replicating from offset 0
2018/10/11 14:55:44 pilosa: replication error: http: cannot connect to translate store endpoint: Get http://10.239.35.176:10101/internal/translate/data?offset=0: dial tcp 10.239.35.176:10101: connect: no route to host
2018/10/11 14:55:45 pilosa: reconnecting to primary replica
2018/10/11 14:55:45 pilosa: replicating from offset 0
2018/10/11 14:55:45 pilosa: replication error: http: cannot connect to translate store endpoint: Get http://10.239.35.176:10101/internal/translate/data?offset=0: dial tcp 10.239.35.176:10101: connect: no route to host
2018/10/11 14:55:46 pilosa: reconnecting to primary replica
2018/10/11 14:55:46 pilosa: replicating from offset 0
2018/10/11 14:55:46 pilosa: replication error: http: cannot connect to translate store endpoint: Get http://10.239.35.176:10101/internal/translate/data?offset=0: dial tcp 10.239.35.176:10101: connect: no route to host
2018/10/11 14:55:47 pilosa: reconnecting to primary replica
2018/10/11 14:55:47 pilosa: replicating from offset 0
2018/10/11 14:55:47 pilosa: replication error: http: cannot connect to translate store endpoint: Get http://10.239.35.176:10101/internal/translate/data?offset=0: dial tcp 10.239.35.176:10101: connect: no route to host
2018/10/11 14:55:48 pilosa: reconnecting to primary replica
2018/10/11 14:55:48 pilosa: replicating from offset 0
2018/10/11 14:55:48 pilosa: replication error: http: cannot connect to translate store endpoint: Get http://10.239.35.176:10101/internal/translate/data?offset=0: dial tcp 10.239.35.176:10101: connect: no route to host
2018/10/11 14:55:49 pilosa: reconnecting to primary replica
2018/10/11 14:55:49 pilosa: replicating from offset 0
2018/10/11 14:55:49 pilosa: replication error: http: cannot connect to translate store endpoint: Get http://10.239.35.176:10101/internal/translate/data?offset=0: dial tcp 10.239.35.176:10101: connect: no route to host
2018/10/11 14:55:50 pilosa: reconnecting to primary replica
2018/10/11 14:55:50 pilosa: replicating from offset 0
2018/10/11 14:55:50 pilosa: replication error: http: cannot connect to translate store endpoint: Get http://10.239.35.176:10101/internal/translate/data?offset=0: dial tcp 10.239.35.176:10101: connect: no route to host
Thank you for reporting this issue. Could you post the configuration you are using for each node?
All settings are set with ENV variables.
export PILOSA_BIND='10.239.41.59:10101'
export PILOSA_CLUSTER_REPLICAS='3'
export PILOSA_DATA_DIR='/data/pilosa'
export PILOSA_GOSSIP_PORT='11101'
export PILOSA_GOSSIP_SEEDS='dev-pilosa-0.dev-pilosa-headless.dev-pilosa.svc.cluster.local:11101,dev-pilosa-1.dev-pilosa-headless.dev-pilosa.svc.cluster.local:11101,dev-pilosa-2.dev-pilosa-headless.dev-pilosa.svc.cluster.local:11101'
export PILOSA_VERBOSE='true'
Coordinator node additionally has:
export PILOSA_CLUSTER_COORDINATOR='true'
possible that this is resolved by #1717 need to confirm