etcd
etcd copied to clipboard
different behavior on version 2.2.4 and 3.20
We have 3 nodes A, B and C, A is the etcd leader, when we bring down network connection between A and B(this is happening often when A, B and C in different data center), we see different behavior between version 2.2.4 and 3.2.0.
In 2.2.4, when we bring down network connection between A and B, C becomes new leader, so both A and B can connect the new leader. This result is what we want, because we need A and B can communicate through C after the network connection broken between A and B.
In 3.2.0, when we bring down network connection between A and B, B becomes candidate, and can not receive any new data any more, in this case, A and B could not communicate each other even when C is good.
I don't know which is the suggested behavior, and I want to know if there are any solution in following case: I have 3 data centers, A, B and C is in separate data center, so when network connection lost between A and B, how can we read/write data from/to etcd cluster on all data center from etcd clients?
The version 2.2.4 can achieve this goal, but I am not sure if this is obsolete. And if we want to use new version, if there is any parameter to open this feature.
we see different behavior between version 2.2.4 and 3.2.0.
How did you configure them (e.g. heartbeats, election timeout)? If leader election happens from network disconnection, they would behave the same.
When v3.2 cluster's leader goes down from SIGINT, the leader automatically transfers its leadership to one of the longest connected peers. For 3.3, we are making leadership transfer configurable (https://github.com/coreos/etcd/issues/7584). But leadership transfer does not happen in network disconnection cases.
etcd doesn't reason about the network topology enough for that case. Even if B triggered a new election there's no guarantee A and C won't elect A again; it's surprising that kind of failover appears to work reliably on 2.2.4.
Interesting use case that might be worth deliberately pursuing, though.
The heartbeat/election timeout is setting as default value:
heartbeat = 100ms election = 1000ms
Just like what heyitsanthony said, when B triggered a new election, in v3.2.0, A and C won't elect again, while in v2.2.4, C will elect as leader each time.
Is there any configuration/parameters to set in v3.2.0 to make node C as the leader when network broken between node A and node B? As this is what we want in our practice.
@rinsozheng no, not yet; etcd doesn't reason about the network topology enough to do that.
Hello everybody! I investigated this problem and found the following.
After the commit 337ef64ed, the follower who see all other nodes cannot be elected as the leader,
The old behavior could be returned until some time by disabling CheckQuorum
but this parameter is not configurable, and is set to true in the code.
And in newer releases, the PreVote mechanism has appeared, which also does
not allow to successfully elect a follower (who see all the nodes) as a leader
To search for a commit in which the behavior has changed, I used the following script.
#!/bin/bash
set -xue
# git history
# v3.5.0
# ...
# v3.0.0
# ...
NEW_BEHAVIOR=337ef64ed # The network split breaks the follower who don't see the leader, the leader cannot be re-elected
OLD_BEHAVIOR=fb64c8ccf # The leader is successfully re-elected after the network split
# ...
# v3.0.0-beta.0
# ...
# v2.2.4
# Let's say there are three nodes X Y Z, node Z is the current cluster leader.
# If at this time there is a network split between nodes Y and Z or X and Z,
# after the commit #337ef64ed, the follower who see all other nodes cannot be elected as the leader,
COMMIT=${1:-$OLD_BEHAVIOR}
if ! grep docker /proc/1/cgroup; then
docker build -t issue-8129 - <<-EOF
FROM ubuntu:latest
RUN apt update
RUN apt -y --force-yes install iptables iproute2 less iputils-ping netcat-openbsd
EOF
if ! [[ "${COMMIT^^*}" =~ HEAD ]]; then
# Old releases that cannot be compiled by new versions of go due to problems with modules
git checkout $COMMIT
docker run -it --rm -v $PWD:/cwd golang:1.12.17 bash -xc '
mkdir -vp /go/src/github.com/coreos;
ln -s /cwd /go/src/github.com/coreos/etcd;
cd $_;
bash -x ./build;
rm -r gopath;
chmod -vR 777 bin
'
else
./build.sh
fi
docker run -it --name issue-8129 --privileged --rm -v $PWD:/cwd issue-8129 /cwd/issue-8129.sh
exit 0
fi
echo 1 |tee /proc/sys/net/ipv4/ip_forward
add_iface() {
ip netns add e$1.ns
ip netns exec e$1.ns ip link set up lo
ip link add e$1.l type veth peer name e$1.r
ip link set e$1.r netns e$1.ns
ip netns exec e$1.ns ip link set up e$1.r
ip netns exec e$1.ns ip addr add dev e$1.r 192.168.$1.1/24
ip netns exec e$1.ns ip route add default via 192.168.$1.2
ip link set up dev e$1.l
ip addr add dev e$1.l 192.168.$1.2/24
}
instance() {
add_iface $1
rm -vrf "e$1"
name="e$1"
mkdir -vp $name
until ip netns exec e$1.ns /cwd/bin/etcd --name $name \
--data-dir $name \
--listen-client-urls http://192.168.$1.1:2379 \
--advertise-client-urls http://192.168.$1.1:2379 \
--listen-peer-urls http://192.168.$1.1:2380 \
--initial-advertise-peer-urls http://192.168.$1.1:2380 \
--initial-cluster e1=http://192.168.1.1:2380,e2=http://192.168.2.1:2380,e3=http://192.168.3.1:2380 \
--initial-cluster-token tkn \
--initial-cluster-state new &>>$name/$name.log; do
sleep 1
done
}
check() {
local ver
ver=$(/cwd/bin/etcd --version|grep -Po 'etcd Version:\s\K.*')
if [[ "$ver" =~ ^2[.] ]];then
/cwd/bin/etcdctl \
--endpoint http://192.168.1.1:2379,http://192.168.2.1:2379,http://192.168.3.1:2379 \
-o extended cluster-health
else
cluster=""
if [[ "$ver" =~ ^3[.]3[.] ]]; then
cluster="--cluster=true "
fi
ETCDCTL_API=3 /cwd/bin/etcdctl \
--endpoints 192.168.1.1:2379,192.168.2.1:2379,192.168.3.1:2379 \
$cluster endpoint status -w table
fi
}
DIR="/cwd/tmp-issue-8129"
mkdir -vp $DIR
pushd $DIR
/cwd/bin/etcd --version
instance 1 &
instance 2 &
instance 3 &
until check; do
sleep 0.5 # wait for cluster
done
ps axf|grep '\<[e]tcd '
read -p "add network split? (Y/y):"
iptables -I FORWARD -p tcp -s 192.168.1.1 -d 192.168.3.1 -j REJECT
iptables -I FORWARD -p tcp -s 192.168.3.1 -d 192.168.1.1 -j REJECT
trap 'set -x; pkill -9 -x etcd' TERM EXIT
set +xe
while true; do
echo ">>>>>>>>>>>>>>>> iptables <<<<<<<<<<<<<<<<<<"
iptables-save -c|grep FORWARD
echo "................. check ...................."
check
echo "+++++++++++++++ procs ++++++++++++++++++++++"
ps axf|grep '\<[e]tcd '|cut -c-80
echo "################# LOGS ###################"
tail -n5 e*/e*.log
echo
echo '_________________ [END] _________________'
sleep 5
done
@rinsozheng I can advise, as it seems to me, the right solution to maintain the constant connectivity of all nodes. Let's say node A sees everyone, but node B and C do not see each other because of a network failure. You need to add a low-priority network route to node B (for node C, respectively, in the other direction) through the datacenter of node A, in this case, this route will be used when the direct connectivity between nodes B and C is broken
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.