etcd
etcd copied to clipboard
[3.4] panic: runtime error: invalid memory address or nil pointer dereference
I saw this error multiple times in 3.4 pipeline,
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xab2922]
goroutine 3500 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
/home/runner/go/pkg/mod/go.etcd.io/[email protected]/tx.go:101
go.etcd.io/etcd/mvcc/backend.(*concurrentReadTx).UnsafeRange(0xc000598f00, 0x15792dc, 0x3, 0x3, 0xc0000e34e8, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
/home/runner/work/etcd/etcd/mvcc/backend/read_tx.go:195 +0x642
go.etcd.io/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc000598f30, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x67, 0x0, 0x66, ...)
/home/runner/work/etcd/etcd/mvcc/kvstore_txn.go:147 +0x293
go.etcd.io/etcd/mvcc.(*storeTxnRead).Range(0xc000598f30, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0xc00005e800, ...)
/home/runner/work/etcd/etcd/mvcc/kvstore_txn.go:51 +0xaf
go.etcd.io/etcd/mvcc.(*metricsTxnWrite).Range(0xc000598f60, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0x40fa00, ...)
/home/runner/work/etcd/etcd/mvcc/metrics_txn.go:37 +0xaf
go.etcd.io/etcd/mvcc.(*readView).Range(0xc0005d95c0, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0x0, ...)
/home/runner/work/etcd/etcd/mvcc/kv_view.go:39 +0x125
go.etcd.io/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc001bb20c0)
/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:405 +0x1ef
go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc001bb20c0)
/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:180 +0x2b
created by go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:179 +0x285
FAIL go.etcd.io/etcd/etcdserver/api/v2store 3.484s
FAIL
Refer to https://github.com/etcd-io/etcd/runs/7460637358?check_suite_focus=true
Please see my comment: https://github.com/etcd-io/etcd/issues/14256#issuecomment-1202083883
@ramses, any update on this issue?
I suggest to diff the code between release-3.4 and (releasae-3.5 or main). cc @simFG to take a look as well.
If there is no any progress until early next week, then I may take a deep dive myself.
@ramses, any update on this issue?
@ahrtr, I was not investigating this issue; I was investigating the "split-brain" issue: https://github.com/etcd-io/etcd/issues/14143
Of these two, which one is currently more concerning to you?
@ahrtr Ok, I will look it.
I suggest to resolve this issue firstly. @SimFG @ramses
Comment on https://github.com/etcd-io/etcd/issues/14143,
- I am NOT convinced that it's real issue of etcd. Apart from the lazyfs issue, I even suspect it's issue on the client side, including the test case or jetcd. Of course, it needs more investigation;
- We can start with qualify our corruption detection mechanism (such as https://github.com/etcd-io/etcd/pull/14282 ) and test process (such as Jepsen or our functional test) firstly
@ramses You can go on to see your current issue to avoid us doing repetitive work. cc @ahrtr, if there is any difficulty, I will keep you informed.
- I am NOT convinced that it's real issue of etcd. [...] Of course, it needs more investigation;
I agree on both counts, @ahrtr. I'll keep digging down to be sure which one is it.
@ramses You can go on to see your current issue to avoid us doing repetitive work. cc @ahrtr, if there is any difficulty, I will keep you informed.
Noted, @SimFG . By the way, nice to meet you (in GitHub-verse) 🙂
@ahrtr I didn't find a solution to this problem. At first I thought it was because tx did not call the begin method, which caused the error that tx was nil, but it was not the case after a try. Then it is guessed that the server failed to start, resulting in the tx initialization not being run, but the verification did not get the result.
@ahrtr I didn't find a solution to this problem. At first I thought it was because
txdid not call the begin method, which caused the error thattxwas nil, but it was not the case after a try. Then it is guessed that the server failed to start, resulting in thetxinitialization not being run, but the verification did not get the result.
Thanks @SimFG for the feedback. Let me take care of this issue.
This issue can happen on both main and release-3.4, and I believe release-3.5 has this issue as well.
The root cause is the etcd just stops immediately before (*readView) Range is called, so the tx is nil; accordingly the following range operation panics.
I managed to reproduce this issue intentionally by simulating the situation mentioned above. The code change (on 3.4) is below,
wachao-a01:etcd wachao$ git diff
diff --git a/etcdmain/etcd.go b/etcdmain/etcd.go
index 73328a73d..ca76e3910 100644
--- a/etcdmain/etcd.go
+++ b/etcdmain/etcd.go
@@ -294,6 +294,7 @@ func startEtcdOrProxyV2() {
case <-stopped:
}
+ time.Sleep(10 * time.Second)
osutil.Exit(0)
}
diff --git a/etcdserver/api/v3rpc/watch.go b/etcdserver/api/v3rpc/watch.go
index 6a9204024..e7c27713a 100644
--- a/etcdserver/api/v3rpc/watch.go
+++ b/etcdserver/api/v3rpc/watch.go
@@ -402,6 +402,11 @@ func (sws *serverWatchStream) sendLoop() {
events[i] = &evs[i]
if needPrevKV && !isCreateEvent(evs[i]) {
opt := mvcc.RangeOptions{Rev: evs[i].Kv.ModRevision - 1}
+ var s *etcdserver.EtcdServer = sws.sg.(*etcdserver.EtcdServer)
+ go func() {
+ s.Stop()
+ }()
+ time.Sleep(1*time.Second)
r, err := sws.watchable.Range(evs[i].Kv.Key, nil, opt)
if err == nil && len(r.KVs) != 0 {
events[i].PrevKv = &(r.KVs[0])
diff --git a/mvcc/backend/read_tx.go b/mvcc/backend/read_tx.go
index 91fe72ec5..f088e5fa9 100644
--- a/mvcc/backend/read_tx.go
+++ b/mvcc/backend/read_tx.go
@@ -16,6 +16,7 @@ package backend
import (
"bytes"
+ "fmt"
"math"
"sync"
@@ -192,6 +193,9 @@ func (rt *concurrentReadTx) UnsafeRange(bucketName, key, endKey []byte, limit in
rt.txMu.RUnlock()
if !ok {
rt.txMu.Lock()
+ if rt.tx == nil {
+ fmt.Println("###### IN (*concurrentReadTx) UnsafeRange, tr.tx is nil!!!")
+ }
bucket = rt.tx.Bucket(bucketName)
rt.buckets[bn] = bucket
rt.txMu.Unlock()
Then the issue can always be reproduced when running test TestV3WatchWithPrevKV,
wachao-a01:integration wachao$ go test -v -cpu 2 -run TestV3WatchWithPrevKV
=== RUN TestV3WatchWithPrevKV
###### IN (*concurrentReadTx) UnsafeRange, tr.tx is nil!!!
v3_watch_test.go:1212: timeout waiting for watch response
{"level":"warn","ts":"2022-08-02T10:38:23.889+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e005a80b-1a7c-4a2c-83ed-a52cd49a61c4/localhost:83867718980294516900","attempt":0,"error":"rpc error: code = Unknown desc = raft: stopped"}
v3_watch_test.go:1168: rpc error: code = Unknown desc = raft: stopped
v3_watch_test.go:1198: rpc error: code = Canceled desc = context canceled
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x174694d]
goroutine 216 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
/Users/wachao/go/gopath/pkg/mod/go.etcd.io/[email protected]/tx.go:101
go.etcd.io/etcd/mvcc/backend.(*concurrentReadTx).UnsafeRange(0xc000620300, 0x2316484, 0x3, 0x3, 0xc00011e5b8, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/backend/read_tx.go:199 +0x64d
go.etcd.io/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc000620330, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x3, 0x0, 0x2, ...)
/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kvstore_txn.go:147 +0x293
go.etcd.io/etcd/mvcc.(*storeTxnRead).Range(0xc000620330, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc000100000, ...)
/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kvstore_txn.go:51 +0xaf
go.etcd.io/etcd/mvcc.(*metricsTxnWrite).Range(0xc000620360, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, ...)
/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/metrics_txn.go:37 +0xaf
go.etcd.io/etcd/mvcc.(*readView).Range(0xc0002101f0, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc000351300, ...)
/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kv_view.go:39 +0x125
go.etcd.io/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc000176b40)
/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:410 +0x276
go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc000176b40)
/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:180 +0x2b
created by go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:179 +0x285
exit status 2
FAIL go.etcd.io/etcd/integration 31.025s
The possibility of running into this issue is really low in production environment. It's even harder to reproduce this issue in the pipeline after merging https://github.com/etcd-io/etcd/pull/14290 . Since the etcdserver should have already been stopped when running into this issue, it means the last transaction should have already been committed, so it should be safe. So I don't think it's a blocker for 3.4.20 any more.
I will think about a formal fix or refactoring in future.
i think i've met the same issue, start etcd v3.4.20 via static pod in a k8s cluster
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x75745b]
goroutine 265 [running]:
go.uber.org/zap.(*Logger).check(0x0, 0x1, 0x10a3ea5, 0x36, 0xc0075d9300)
/root/go/pkg/mod/go.uber.org/[email protected]/logger.go:264 +0x9b
go.uber.org/zap.(*Logger).Warn(0x0, 0x10a3ea5, 0x36, 0xc0075d9300, 0x2, 0x2)
/root/go/pkg/mod/go.uber.org/[email protected]/logger.go:194 +0x45
go.etcd.io/etcd/etcdserver.(*EtcdServer).requestCurrentIndex(0xc0000b0600, 0xc000dabc20, 0x5f382e3fe91c3e3, 0x0, 0x0, 0x0)
/tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:805 +0x873
go.etcd.io/etcd/etcdserver.(*EtcdServer).linearizableReadLoop(0xc0000b0600)
/tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:721 +0x2d6
go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1(0xc0000b0600, 0xc0007f5d60)
/tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2698 +0x57
created by go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach
/tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2696 +0x1b1
One quick question, did you just see this issue once or can you easily reproduce this issue? @JohnJAS
hi @ahrtr It is hard for me to answer the question now because I only have checked one env and I found this issue inside the dead etcd container. But it is true that we kept met with the ETCD issue after we've upgraded etcd to v3.4.20. Our daily pipeline was blocked for several days until we downgraded the etcd version. I will try to manually upgrade etcd to see if I can reproduce this issue easily. And I will update my comments later.
@ahrtr I've reproduced this issue on another cluster and just hit on the same issue.
I manually upgraded the etcd pods on each node one by one from v3.4.16 to v3.4.20. Then etcd pods crashed and exited like the following.
Error logs:
2022-08-29 06:29:02.702707 E | rafthttp: failed to read 88d5d8ebd854df5a on stream MsgApp v2 (unexpected EOF)
2022-08-29 06:29:02.702723 I | rafthttp: peer 88d5d8ebd854df5a became inactive (message send to peer failed)
2022-08-29 06:29:02.703504 W | rafthttp: lost the TCP streaming connection with peer 88d5d8ebd854df5a (stream Message reader)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x75745b]
goroutine 287 [running]:
go.uber.org/zap.(*Logger).check(0x0, 0x1, 0x10a3ea5, 0x36, 0xc003fb7480)
/root/go/pkg/mod/go.uber.org/[email protected]/logger.go:264 +0x9b
go.uber.org/zap.(*Logger).Warn(0x0, 0x10a3ea5, 0x36, 0xc003fb7480, 0x2, 0x2)
/root/go/pkg/mod/go.uber.org/[email protected]/logger.go:194 +0x45
go.etcd.io/etcd/etcdserver.(*EtcdServer).requestCurrentIndex(0xc000398600, 0xc005210f00, 0xd15682e845891c12, 0x0, 0x0, 0x0)
/tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:805 +0x873
go.etcd.io/etcd/etcdserver.(*EtcdServer).linearizableReadLoop(0xc000398600)
/tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:721 +0x2d6
go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1(0xc000398600, 0xc00011ec20)
/tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2698 +0x57
created by go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach
/tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2696 +0x1b1
Add something I have observed. When I try to use helm upgrade command which will call kube-apiserver API internally (kube-apiserver depends on etcd in the backend), this will always trigger the etcd crashed.
So I think this issue is not related the env and it is easy to reproduce.
However, ETCD v3.4.16 works well, but we still got some CVEs on this version and that's why we want to upgrade ETCD.
I will test etcd v3.4.19 which including the fix for those CVEs tomorrow.
Update later.
We hit the same problem with etcd version 3.4.20 and it is very easy to reproduce in our environment.
I can confirm that etcd v3.4.19 doesn't have this issue and we are happy to see that the two CVEs are fixed in this version. It looks like some fix in v3.4.20 had imported the current issue.
However, the error message in my case is not 100% as yours, though they are all about the nil pointor issue. I won't create another one in case if they have the same root cause.
Hope my research can help to narrow down the range of root cause.
Thanks @JohnJAS and @rtheis for the feedback. It seems that it's a different issue with this one, I just raised another issue https://github.com/etcd-io/etcd/issues/14402 to track it. Please feel free to deliver a PR for it if you are interested.