etcd Lock api stuck if multiple call abort concurrently

trafficstars

we found a dead lock in LockServer if large amount of call abort when waiting for the same lock. New lock cannot be acquired anymore after dead lock happened.

the issue can be reproduced by following code:

cfz@cfz-desktop:~/ws/etcd-issue-13336$ cat main.go 
package main

import (
        "context"
        "fmt"
        "sync"
        "time"

        clientv3 "go.etcd.io/etcd/client/v3"
        "go.etcd.io/etcd/client/v3/concurrency"
        "go.etcd.io/etcd/server/v3/etcdserver/api/v3lock/v3lockpb"
)

func main() {
        cli, _ := clientv3.New(clientv3.Config{
                Endpoints:   []string{"localhost:2379"},
                DialTimeout: 3 * time.Second,
        })

        defer cli.Close()

        lc := v3lockpb.NewLockClient(cli.ActiveConnection())

        var wg sync.WaitGroup
        wg.Add(2) // expect 2 clients can aquire lock before exit

        lockFunc := func() {
                s, _ := concurrency.NewSession(cli, concurrency.WithTTL(5))
                res, _ := lc.Lock(context.TODO(), &v3lockpb.LockRequest{Name: []byte("/lock"), Lease: int64(s.Lease())})
                fmt.Printf("locked %x\n", s.Lease())
                time.Sleep(100 * time.Millisecond)
                lc.Unlock(context.TODO(), &v3lockpb.UnlockRequest{Key: res.Key})
                wg.Done()
        }

        for i := 0; i < 50; i++ {
                go lockFunc()
        }

        wg.Wait()
}
cfz@cfz-desktop:~/ws/etcd-issue-13336$ 
cfz@cfz-desktop:~/ws/etcd-issue-13336$ go run main.go 
locked 694d7bca99b19c04
locked 694d7bca99b19c07
cfz@cfz-desktop:~/ws/etcd-issue-13336$ 
cfz@cfz-desktop:~/ws/etcd-issue-13336$ go run main.go 
locked 694d7bca99b19cef
^Csignal: interrupt <-- stuck here
cfz@cfz-desktop:~/ws/etcd-issue-13336$

the expected behavior of the program is that, 2 locks can be require before the program exit.

as show in the console, the 2nd round of execution of the program stuck.

after some digging, we found the issue is caused by a dead lock between LockServer's watch client and WatchSever, while the former one will sending "cancel" messages to watch server synchronously when lock client aborted, and latter one will response "cancelled" messages synchronously(thru a channel with 16 msg buffer).

if large amount of client aborted simultaneously, the event loop of LockServer's watch client maybe saturated by closing events, and makes message receiving path slow, which eventually lead to watch server's ctl stream buffer piled up.

Sep 09 '21 13:09 cfz

event a normal watch api may get stuck when the watcher number is large:

func main() {
	cli, _ := clientv3.New(clientv3.Config{
		Endpoints:   []string{"localhost:2379"},
		DialTimeout: 3 * time.Second,
	})

	defer cli.Close()

	var wg sync.WaitGroup
	ctx, cancel := context.WithCancel(context.TODO())

	watchFunc := func() {
		cli.Watch(ctx, "")
		wg.Done()
	}

	for i := 0; i < 20000; i++ {
		wg.Add(1)
		go watchFunc()
	}

	wg.Wait()

	fmt.Println("closing")
	cancel()

	time.Sleep(2 * time.Second)

	fmt.Println("watching again")
	cli.Watch(context.TODO(), "") // <--- would stuck here
	fmt.Println("watched")
}

i guess the reason why we need much more watcher to jam the watch channel is that, the both grpc and underlying tcp transport layer has buffer, while lock api using channels with 1 and 0 buffer.

Sep 13 '21 05:09 cfz

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Dec 13 '21 21:12 stale[bot]

Thanks for positioning the problem, I also found same bug when I pressure test etcd, but I can't find out why. So I rebuilt your code, but it work imperfect. It looks like work well when I stop my program completely and restart after an interval of time, but not well when I restart immediately. Have you any further bug fix ?

Dec 17 '21 11:12 lpcy

hi @lpcy , i think this is the expected behavior of my fix. as you can see, i add a 250ms timeout on sending loop, which means if there were large number of "canceled" event piped up(in case of lots of watcher disconnect abnormally), it takes times to drop all of them.

i cannot find a easy way to perfectly fix the problem, which may introduce refactor the current watch client/server logic.

Dec 25 '21 13:12 cfz

hi @lpcy , i think this is the expected behavior of my fix. as you can see, i add a 250ms timeout on sending loop, which means if there were large number of "canceled" event piped up(in case of lots of watcher disconnect abnormally), it takes times to drop all of them.

i cannot find a easy way to perfectly fix the problem, which may introduce refactor the current watch client/server logic.

Thanks for taking the time reply. Now I have no idea how to continue but to wait for maintainers who can fix it.

Dec 28 '21 06:12 lpcy

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Mar 30 '22 05:03 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Jul 10 '22 13:07 stale[bot]

etcd etcd copied to clipboard

Lock api stuck if multiple call abort concurrently

etcd
etcd copied to clipboard