etcd
etcd copied to clipboard
UnsafeRange panicking during shutdown
Bug report criteria
- [ ] This bug report is not security related, security issues should be disclosed privately via [email protected].
- [ ] This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
- [ ] You have read the etcd bug reporting guidelines.
- [ ] Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
Test case TestMaintenanceSnapshotCancel failed and panicking.
Refer to https://github.com/etcd-io/etcd/actions/runs/7463174417/job/20307221683?pr=17220
Based on the log, the reason should be that the backend has already been closed (the member is being stopped) before the snapshot operation, https://github.com/etcd-io/etcd/blob/a2eb17c8091893796e835cd564c78a7b4c917c21/server/storage/backend/backend.go#L331
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa6c27c]
goroutine 19029 [running]:
go.etcd.io/bbolt.(*Cursor).seek(0xc00044a578, {0x1aa05d4, 0x4, 0x4})
/home/runner/go/pkg/mod/go.etcd.io/[email protected]/cursor.go:159 +0x7c
go.etcd.io/bbolt.(*Bucket).Bucket(0xc0009da638, {0x1aa05d4, 0x4, 0x4})
/home/runner/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:105 +0x10c
go.etcd.io/bbolt.(*Tx).Bucket(...)
/home/runner/go/pkg/mod/go.etcd.io/[email protected]/tx.go:102
go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).UnsafeRange(0xc0017a7da0, {0x12e17b0, 0x1b0dc40}, {0x1aa1670, 0x10, 0x10}, {0x0, 0x0, 0x0}, 0xc4798?)
/home/runner/actions-runner/_work/etcd/etcd/server/storage/backend/batch_tx.go:174 +0xa0
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadConsistentIndex({0xffff39290758, 0xc0017a7da0})
/home/runner/actions-runner/_work/etcd/etcd/server/storage/schema/cindex.go:41 +0xa0
go.etcd.io/etcd/server/v3/storage/schema.unsafeUpdateConsistentIndex.func1()
/home/runner/actions-runner/_work/etcd/etcd/server/storage/schema/cindex.go:80 +0x54
go.etcd.io/etcd/client/pkg/v3/verify.Verify(0xc00044aa28)
/home/runner/actions-runner/_work/etcd/etcd/client/pkg/verify/verify.go:71 +0x44
go.etcd.io/etcd/server/v3/storage/schema.unsafeUpdateConsistentIndex({0x12e5920, 0xc0017a7da0}, 0x7, 0x2, 0x0)
/home/runner/actions-runner/_work/etcd/etcd/server/storage/schema/cindex.go:79 +0x1d4
go.etcd.io/etcd/server/v3/storage/schema.UnsafeUpdateConsistentIndex(...)
/home/runner/actions-runner/_work/etcd/etcd/server/storage/schema/cindex.go:67
go.etcd.io/etcd/server/v3/etcdserver/cindex.(*consistentIndex).UnsafeSave(0xc001727800, {0x12e5920, 0xc0017a7da0})
/home/runner/actions-runner/_work/etcd/etcd/server/etcdserver/cindex/cindex.go:121 +0x68
go.etcd.io/etcd/server/v3/storage.(*BackendHooks).OnPreCommitUnsafe(0xc001a0d9e0, {0x12e5920?, 0xc0017a7da0})
/home/runner/actions-runner/_work/etcd/etcd/server/storage/hooks.go:45 +0x64
go.etcd.io/etcd/server/v3/storage/backend.(*batchTxBuffered).unsafeCommit(0xc0017a7da0, 0x0)
/home/runner/actions-runner/_work/etcd/etcd/server/storage/backend/batch_tx.go:342 +0xfc
go.etcd.io/etcd/server/v3/storage/backend.(*batchTxBuffered).commit(0xc0017a7da0, 0xdc?)
/home/runner/actions-runner/_work/etcd/etcd/server/storage/backend/batch_tx.go:335 +0x70
go.etcd.io/etcd/server/v3/storage/backend.(*batchTxBuffered).Commit(0xc0017a7da0)
/home/runner/actions-runner/_work/etcd/etcd/server/storage/backend/batch_tx.go:322 +0x3c
go.etcd.io/etcd/server/v3/storage/backend.(*backend).Snapshot(0xc000329140)
/home/runner/actions-runner/_work/etcd/etcd/server/storage/backend/backend.go:331 +0x54
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*maintenanceServer).Snapshot(0xc0029ace40, 0x12e1120?, {0x12e6550, 0xc0006302b0})
/home/runner/actions-runner/_work/etcd/etcd/server/etcdserver/api/v3rpc/maintenance.go:111 +0x110
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*authMaintenanceServer).Snapshot(0xc0019d7b60, 0x103b4c0?, {0x12e6550, 0xc0006302b0})
/home/runner/actions-runner/_work/etcd/etcd/server/etcdserver/api/v3rpc/maintenance.go:296 +0xb8
go.etcd.io/etcd/api/v3/etcdserverpb._Maintenance_Snapshot_Handler({0x1004a40?, 0xc0019d7b60}, {0x12e3fc8, 0xc0028e4c00})
/home/runner/actions-runner/_work/etcd/etcd/api/etcdserverpb/rpc.pb.go:7620 +0xe8
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).StreamServerInterceptor.func4({0x1004a40, 0xc0019d7b60}, {0x12e3d80?, 0xc0005681e0}, 0xc0028e4bd0, 0x111e8a8)
/home/runner/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/server_metrics.go:121 +0x128
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.ChainStreamServer.func7.1.1({0x1004a40, 0xc0019d7b60}, {0x12e3d80, 0xc0005681e0})
/home/runner/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:49 +0x70
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newStreamInterceptor.func1({0x1004a40, 0xc0019d7b60}, {0x12e3d80, 0xc0005681e0}, 0xc0028e4bd0, 0xc001992240)
/home/runner/actions-runner/_work/etcd/etcd/server/etcdserver/api/v3rpc/interceptor.go:258 +0x560
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.ChainStreamServer.func7.1.1({0x1004a40, 0xc0019d7b60}, {0x12e3d80, 0xc0005681e0})
/home/runner/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:49 +0x70
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.ChainStreamServer.func7({0x1004a40, 0xc0019d7b60}, {0x12e3d80, 0xc0005681e0}, 0xc0028e4bd0, 0x111e8a8)
/home/runner/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:58 +0x100
google.golang.org/grpc.(*Server).processStreamingRPC(0xc00162c400, {0x12e1120, 0xc00185e870}, {0x12e8340, 0xc000753380}, 0xc00[284](https://github.com/etcd-io/etcd/actions/runs/7463174417/job/20307221683?pr=17220#step:5:285)6b40, 0xc0028ac2d0, 0x1b0c680, 0x0)
/home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:1673 +0x16b8
google.golang.org/grpc.(*Server).handleStream(0xc00162c400, {0x12e8340, 0xc000753380}, 0xc002846b40)
/home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:1787 +0x12f0
google.golang.org/grpc.(*Server).serveStreams.func2.1()
/home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:1016 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 19027
/home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:1027 +0x1d8
What did you expect to happen?
No panicking on processing any client requests
How can we reproduce it (as minimally and precisely as possible)?
Write an integration test to stop a member before call the snapshot api.
Anything else we need to know?
No response
Etcd version (please run commands below)
$ etcd --version
# paste output here
$ etcdctl version
# paste output here
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
No response
UPDATED:
When server closes the backend, the backend will stop background commit goroutine and reset read transaction.
https://github.com/etcd-io/etcd/blob/c8b4b168aff7f84d3b10dc9ddc13a381f15a710c/server/storage/backend/read_tx.go#L133-L138
For baseReadTx, it doesn't check the tx in UnsafeRange and update txWg.
https://github.com/etcd-io/etcd/blob/c8b4b168aff7f84d3b10dc9ddc13a381f15a710c/server/storage/backend/batch_tx.go#L367-L376
The tx can be set to nil during UnsafeRange.
Another case with https://github.com/etcd-io/bbolt/issues/715
@fuweid would you be able to propose a fix? The issue showed up in robustness tests, which I would prefer to keep flake free.
Hi @serathius Sure. Will file pull request later.
Hi @ahrtr @serathius
sorry for taking so long on this issue. This issue has been fixed by gRPC layer https://github.com/grpc/grpc-go/commit/61eab37838ce213237ecb31aa7cdf95241851431 (released by v1.61.0 to fix regression): All the requests are tracked by handlerWG wait group. When we call GracefulStop, it will block until all the inflight requests are closed, including streaming type RPC, like Watch/Snapshot/LeaseRenew.
We call gracefulstop when we received SIGTERM signal, except cmux-mode.
https://github.com/etcd-io/etcd/blob/a7f5d4b4e4569bd316277ebf1347785e0467c64d/server/embed/etcd.go#L474-L493
We don't need to setup timeout for draining things because GracefulStop always blocks until all the RPCs finished.
Since GracefulStop isn't applied for cmux-mode, I filed a pull request https://github.com/etcd-io/etcd/pull/17790 to support cmux-mode's graceful shutdown. Even if we run ETCD without #17790, only ongoing Snapshot/Watch RPCs might panic, because we stop applier channel and scheduler before stop backend and all the unary RPCs won't use closed backend.
https://github.com/etcd-io/etcd/blob/a7f5d4b4e4569bd316277ebf1347785e0467c64d/server/etcdserver/server.go#L816-L827
And the https://github.com/etcd-io/etcd/pull/17757 is also enhancement for failpoint test. PTAL. Thanks
Side note: I was using old version (1.60.1) so that previous approach is to introduce txRef object to maintain reference count about all opening TX. The caller must call txPut to release the reference explict. If the backend has been closed, the ReadTx/ConcurrentReadTx/BatchTx should return closed error. However, in our codebase, both mvcc and auth layers assume the backend is always valid. The ReadTx/ConcurrenReadTx/BatchTx are kind of direct pointer. And UnsafeRange doesn't intend to return error. I tried to update all the interfaces to force it return error. It passes all the e2e and UT. However, there are too many changes.
If the server layer can track active RPCs, it will be better. So, I revisit the gRPC code and find that WaitForHandlers can help us.
type txRef struct {
sync.RWMutex
wg sync.WaitGroup
}
type Backend interface {
ReadTx() (ReadTx, TxRefReleaseFunc, error)
ConcurrentReadTx() (ReadTx, TxRefReleaseFunc, error)
BatchTx() (BatchTx, TxRefReleaseFunc, error)
}
tx, txPut, err := ReadTx() // ConncurrentReadTx() / BatchTx()
if err != nil {
return err
}
defer txPut()
..