etcd different behavior on version 2.2.4 and 3.20

We have 3 nodes A, B and C, A is the etcd leader, when we bring down network connection between A and B(this is happening often when A, B and C in different data center), we see different behavior between version 2.2.4 and 3.2.0.

In 2.2.4, when we bring down network connection between A and B, C becomes new leader, so both A and B can connect the new leader. This result is what we want, because we need A and B can communicate through C after the network connection broken between A and B.

In 3.2.0, when we bring down network connection between A and B, B becomes candidate, and can not receive any new data any more, in this case, A and B could not communicate each other even when C is good.

I don't know which is the suggested behavior, and I want to know if there are any solution in following case: I have 3 data centers, A, B and C is in separate data center, so when network connection lost between A and B, how can we read/write data from/to etcd cluster on all data center from etcd clients?

The version 2.2.4 can achieve this goal, but I am not sure if this is obsolete. And if we want to use new version, if there is any parameter to open this feature.

Jun 19 '17 03:06 rinsozheng

we see different behavior between version 2.2.4 and 3.2.0.

How did you configure them (e.g. heartbeats, election timeout)? If leader election happens from network disconnection, they would behave the same.

When v3.2 cluster's leader goes down from SIGINT, the leader automatically transfers its leadership to one of the longest connected peers. For 3.3, we are making leadership transfer configurable (https://github.com/coreos/etcd/issues/7584). But leadership transfer does not happen in network disconnection cases.

Jun 19 '17 16:06 gyuho

etcd doesn't reason about the network topology enough for that case. Even if B triggered a new election there's no guarantee A and C won't elect A again; it's surprising that kind of failover appears to work reliably on 2.2.4.

Interesting use case that might be worth deliberately pursuing, though.

Jun 19 '17 21:06 heyitsanthony

The heartbeat/election timeout is setting as default value:

heartbeat = 100ms election = 1000ms

Just like what heyitsanthony said, when B triggered a new election, in v3.2.0, A and C won't elect again, while in v2.2.4, C will elect as leader each time.

Jun 20 '17 02:06 rinsozheng

Is there any configuration/parameters to set in v3.2.0 to make node C as the leader when network broken between node A and node B? As this is what we want in our practice.

Jun 27 '17 11:06 rinsozheng

@rinsozheng no, not yet; etcd doesn't reason about the network topology enough to do that.

Jul 05 '17 19:07 heyitsanthony

Hello everybody! I investigated this problem and found the following.

After the commit 337ef64ed, the follower who see all other nodes cannot be elected as the leader, The old behavior could be returned until some time by disabling CheckQuorum but this parameter is not configurable, and is set to true in the code. And in newer releases, the PreVote mechanism has appeared, which also does not allow to successfully elect a follower (who see all the nodes) as a leader

To search for a commit in which the behavior has changed, I used the following script.

#!/bin/bash
set -xue

# git history
# v3.5.0
# ...
# v3.0.0
# ...
NEW_BEHAVIOR=337ef64ed # The network split breaks the follower who don't see the leader, the leader cannot be re-elected
OLD_BEHAVIOR=fb64c8ccf # The leader is successfully re-elected after the network split
# ...
# v3.0.0-beta.0
# ...
# v2.2.4

# Let's say there are three nodes X Y Z, node Z is the current cluster leader.
# If at this time there is a network split between nodes Y and Z or X and Z,
# after the commit #337ef64ed, the follower who see all other nodes cannot be elected as the leader,

COMMIT=${1:-$OLD_BEHAVIOR}

if ! grep docker /proc/1/cgroup; then
    docker build -t issue-8129 - <<-EOF
    FROM ubuntu:latest
    RUN apt update
    RUN apt -y --force-yes install iptables iproute2 less iputils-ping netcat-openbsd
EOF
    if ! [[ "${COMMIT^^*}" =~ HEAD ]]; then
        # Old releases that cannot be compiled by new versions of go due to problems with modules
        git checkout $COMMIT
        docker run -it --rm -v $PWD:/cwd  golang:1.12.17 bash -xc '
            mkdir -vp /go/src/github.com/coreos;
            ln -s /cwd /go/src/github.com/coreos/etcd;
            cd $_;
            bash -x ./build;
            rm -r gopath;
            chmod -vR 777 bin
        '
    else
        ./build.sh
    fi
    docker run -it --name issue-8129 --privileged --rm -v $PWD:/cwd issue-8129 /cwd/issue-8129.sh
    exit 0
fi

echo 1 |tee /proc/sys/net/ipv4/ip_forward
add_iface() {
    ip netns add e$1.ns
    ip netns exec e$1.ns ip link set up lo

    ip link add e$1.l type veth peer name e$1.r
    ip link set e$1.r netns e$1.ns

    ip netns exec e$1.ns ip link set up e$1.r
    ip netns exec e$1.ns ip addr add dev e$1.r 192.168.$1.1/24
    ip netns exec e$1.ns ip route add default via 192.168.$1.2

    ip link set up dev e$1.l
    ip addr add dev e$1.l 192.168.$1.2/24
}

instance() {
    add_iface $1

    rm -vrf "e$1"
    name="e$1"
    mkdir -vp $name

    until ip netns exec e$1.ns /cwd/bin/etcd --name $name \
          --data-dir $name \
          --listen-client-urls http://192.168.$1.1:2379 \
          --advertise-client-urls http://192.168.$1.1:2379 \
          --listen-peer-urls http://192.168.$1.1:2380 \
          --initial-advertise-peer-urls http://192.168.$1.1:2380 \
          --initial-cluster e1=http://192.168.1.1:2380,e2=http://192.168.2.1:2380,e3=http://192.168.3.1:2380 \
          --initial-cluster-token tkn \
          --initial-cluster-state new &>>$name/$name.log; do
        sleep 1
    done
}

check() {
    local ver
    ver=$(/cwd/bin/etcd --version|grep -Po 'etcd Version:\s\K.*')
    if [[ "$ver" =~ ^2[.] ]];then
        /cwd/bin/etcdctl \
          --endpoint http://192.168.1.1:2379,http://192.168.2.1:2379,http://192.168.3.1:2379 \
          -o extended cluster-health
    else
        cluster=""
        if [[ "$ver" =~ ^3[.]3[.] ]]; then 
            cluster="--cluster=true "
        fi
        ETCDCTL_API=3 /cwd/bin/etcdctl \
          --endpoints 192.168.1.1:2379,192.168.2.1:2379,192.168.3.1:2379 \
          $cluster endpoint status -w table
    fi
}

DIR="/cwd/tmp-issue-8129"
mkdir -vp $DIR
pushd $DIR

/cwd/bin/etcd --version

instance 1 &
instance 2 &
instance 3 &

until check; do
    sleep 0.5 # wait for cluster
done
ps axf|grep '\<[e]tcd '

read -p "add network split? (Y/y):"
iptables -I FORWARD -p tcp -s 192.168.1.1 -d 192.168.3.1 -j REJECT
iptables -I FORWARD -p tcp -s 192.168.3.1 -d 192.168.1.1 -j REJECT

trap 'set -x; pkill -9 -x etcd' TERM EXIT
set +xe
while true; do
    echo ">>>>>>>>>>>>>>>> iptables <<<<<<<<<<<<<<<<<<"
    iptables-save -c|grep FORWARD
    echo "................. check ...................."
    check
    echo "+++++++++++++++ procs ++++++++++++++++++++++"
    ps axf|grep '\<[e]tcd '|cut -c-80
    echo "################# LOGS ###################"
    tail -n5 e*/e*.log
    echo
    echo '_________________ [END] _________________'
    sleep 5
done

Jul 23 '21 17:07 sakateka

@rinsozheng I can advise, as it seems to me, the right solution to maintain the constant connectivity of all nodes. Let's say node A sees everyone, but node B and C do not see each other because of a network failure. You need to add a low-priority network route to node B (for node C, respectively, in the other direction) through the datacenter of node A, in this case, this route will be used when the direct connectivity between nodes B and C is broken

Jul 23 '21 17:07 sakateka

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sep 21 '22 02:09 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Dec 31 '22 23:12 stale[bot]

etcd etcd copied to clipboard

different behavior on version 2.2.4 and 3.20

etcd
etcd copied to clipboard