etcd
etcd copied to clipboard
Lock api stuck if multiple call abort concurrently
we found a dead lock in LockServer if large amount of call abort when waiting for the same lock. New lock cannot be acquired anymore after dead lock happened.
the issue can be reproduced by following code:
cfz@cfz-desktop:~/ws/etcd-issue-13336$ cat main.go
package main
import (
"context"
"fmt"
"sync"
"time"
clientv3 "go.etcd.io/etcd/client/v3"
"go.etcd.io/etcd/client/v3/concurrency"
"go.etcd.io/etcd/server/v3/etcdserver/api/v3lock/v3lockpb"
)
func main() {
cli, _ := clientv3.New(clientv3.Config{
Endpoints: []string{"localhost:2379"},
DialTimeout: 3 * time.Second,
})
defer cli.Close()
lc := v3lockpb.NewLockClient(cli.ActiveConnection())
var wg sync.WaitGroup
wg.Add(2) // expect 2 clients can aquire lock before exit
lockFunc := func() {
s, _ := concurrency.NewSession(cli, concurrency.WithTTL(5))
res, _ := lc.Lock(context.TODO(), &v3lockpb.LockRequest{Name: []byte("/lock"), Lease: int64(s.Lease())})
fmt.Printf("locked %x\n", s.Lease())
time.Sleep(100 * time.Millisecond)
lc.Unlock(context.TODO(), &v3lockpb.UnlockRequest{Key: res.Key})
wg.Done()
}
for i := 0; i < 50; i++ {
go lockFunc()
}
wg.Wait()
}
cfz@cfz-desktop:~/ws/etcd-issue-13336$
cfz@cfz-desktop:~/ws/etcd-issue-13336$ go run main.go
locked 694d7bca99b19c04
locked 694d7bca99b19c07
cfz@cfz-desktop:~/ws/etcd-issue-13336$
cfz@cfz-desktop:~/ws/etcd-issue-13336$ go run main.go
locked 694d7bca99b19cef
^Csignal: interrupt <-- stuck here
cfz@cfz-desktop:~/ws/etcd-issue-13336$
the expected behavior of the program is that, 2 locks can be require before the program exit.
as show in the console, the 2nd round of execution of the program stuck.
after some digging, we found the issue is caused by a dead lock between LockServer's watch client and WatchSever, while the former one will sending "cancel" messages to watch server synchronously when lock client aborted, and latter one will response "cancelled" messages synchronously(thru a channel with 16 msg buffer).
if large amount of client aborted simultaneously, the event loop of LockServer's watch client maybe saturated by closing events, and makes message receiving path slow, which eventually lead to watch server's ctl stream buffer piled up.
event a normal watch api may get stuck when the watcher number is large:
func main() {
cli, _ := clientv3.New(clientv3.Config{
Endpoints: []string{"localhost:2379"},
DialTimeout: 3 * time.Second,
})
defer cli.Close()
var wg sync.WaitGroup
ctx, cancel := context.WithCancel(context.TODO())
watchFunc := func() {
cli.Watch(ctx, "")
wg.Done()
}
for i := 0; i < 20000; i++ {
wg.Add(1)
go watchFunc()
}
wg.Wait()
fmt.Println("closing")
cancel()
time.Sleep(2 * time.Second)
fmt.Println("watching again")
cli.Watch(context.TODO(), "") // <--- would stuck here
fmt.Println("watched")
}
i guess the reason why we need much more watcher to jam the watch channel is that, the both grpc and underlying tcp transport layer has buffer, while lock api using channels with 1 and 0 buffer.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Thanks for positioning the problem, I also found same bug when I pressure test etcd, but I can't find out why. So I rebuilt your code, but it work imperfect. It looks like work well when I stop my program completely and restart after an interval of time, but not well when I restart immediately. Have you any further bug fix ?
hi @lpcy , i think this is the expected behavior of my fix. as you can see, i add a 250ms timeout on sending loop, which means if there were large number of "canceled" event piped up(in case of lots of watcher disconnect abnormally), it takes times to drop all of them.
i cannot find a easy way to perfectly fix the problem, which may introduce refactor the current watch client/server logic.
hi @lpcy , i think this is the expected behavior of my fix. as you can see, i add a 250ms timeout on sending loop, which means if there were large number of "canceled" event piped up(in case of lots of watcher disconnect abnormally), it takes times to drop all of them.
i cannot find a easy way to perfectly fix the problem, which may introduce refactor the current watch client/server logic.
Thanks for taking the time reply. Now I have no idea how to continue but to wait for maintainers who can fix it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.