etcd [3.4] panic: runtime error: invalid memory address or nil pointer dereference

I saw this error multiple times in 3.4 pipeline,

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xab2922]
goroutine 3500 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
	/home/runner/go/pkg/mod/go.etcd.io/[email protected]/tx.go:101
go.etcd.io/etcd/mvcc/backend.(*concurrentReadTx).UnsafeRange(0xc000598f00, 0x15792dc, 0x3, 0x3, 0xc0000e34e8, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
	/home/runner/work/etcd/etcd/mvcc/backend/read_tx.go:195 +0x642
go.etcd.io/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc000598f30, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x67, 0x0, 0x66, ...)
	/home/runner/work/etcd/etcd/mvcc/kvstore_txn.go:147 +0x293
go.etcd.io/etcd/mvcc.(*storeTxnRead).Range(0xc000598f30, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0xc00005e800, ...)
	/home/runner/work/etcd/etcd/mvcc/kvstore_txn.go:51 +0xaf
go.etcd.io/etcd/mvcc.(*metricsTxnWrite).Range(0xc000598f60, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0x40fa00, ...)
	/home/runner/work/etcd/etcd/mvcc/metrics_txn.go:37 +0xaf
go.etcd.io/etcd/mvcc.(*readView).Range(0xc0005d95c0, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0x0, ...)
	/home/runner/work/etcd/etcd/mvcc/kv_view.go:39 +0x125
go.etcd.io/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc001bb20c0)
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:405 +0x1ef
go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc001bb20c0)
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:180 +0x2b
created by go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:179 +0x285
FAIL	go.etcd.io/etcd/etcdserver/api/v2store	3.484s
FAIL

Refer to https://github.com/etcd-io/etcd/runs/7460637358?check_suite_focus=true

Please see my comment: https://github.com/etcd-io/etcd/issues/14256#issuecomment-1202083883

Jul 22 '22 01:07 ahrtr

@ramses, any update on this issue?

Jul 26 '22 21:07 ahrtr

I suggest to diff the code between release-3.4 and (releasae-3.5 or main). cc @simFG to take a look as well.

If there is no any progress until early next week, then I may take a deep dive myself.

Jul 27 '22 22:07 ahrtr

@ramses, any update on this issue?

@ahrtr, I was not investigating this issue; I was investigating the "split-brain" issue: https://github.com/etcd-io/etcd/issues/14143

Of these two, which one is currently more concerning to you?

Jul 28 '22 00:07 ramses

@ahrtr Ok, I will look it.

Jul 28 '22 00:07 SimFG

I suggest to resolve this issue firstly. @SimFG @ramses

Comment on https://github.com/etcd-io/etcd/issues/14143,

I am NOT convinced that it's real issue of etcd. Apart from the lazyfs issue, I even suspect it's issue on the client side, including the test case or jetcd. Of course, it needs more investigation;
We can start with qualify our corruption detection mechanism (such as https://github.com/etcd-io/etcd/pull/14282 ) and test process (such as Jepsen or our functional test) firstly

Jul 28 '22 00:07 ahrtr

@ramses You can go on to see your current issue to avoid us doing repetitive work. cc @ahrtr, if there is any difficulty, I will keep you informed.

Jul 28 '22 00:07 SimFG

I am NOT convinced that it's real issue of etcd. [...] Of course, it needs more investigation;

I agree on both counts, @ahrtr. I'll keep digging down to be sure which one is it.

@ramses You can go on to see your current issue to avoid us doing repetitive work. cc @ahrtr, if there is any difficulty, I will keep you informed.

Noted, @SimFG . By the way, nice to meet you (in GitHub-verse) 🙂

Jul 28 '22 01:07 ramses

@ahrtr I didn't find a solution to this problem. At first I thought it was because tx did not call the begin method, which caused the error that tx was nil, but it was not the case after a try. Then it is guessed that the server failed to start, resulting in the tx initialization not being run, but the verification did not get the result.

Aug 01 '22 01:08 SimFG

@ahrtr I didn't find a solution to this problem. At first I thought it was because tx did not call the begin method, which caused the error that tx was nil, but it was not the case after a try. Then it is guessed that the server failed to start, resulting in the tx initialization not being run, but the verification did not get the result.

Thanks @SimFG for the feedback. Let me take care of this issue.

Aug 01 '22 22:08 ahrtr

This issue can happen on both main and release-3.4, and I believe release-3.5 has this issue as well.

The root cause is the etcd just stops immediately before (*readView) Range is called, so the tx is nil; accordingly the following range operation panics.

I managed to reproduce this issue intentionally by simulating the situation mentioned above. The code change (on 3.4) is below,

wachao-a01:etcd wachao$ git diff
diff --git a/etcdmain/etcd.go b/etcdmain/etcd.go
index 73328a73d..ca76e3910 100644
--- a/etcdmain/etcd.go
+++ b/etcdmain/etcd.go
@@ -294,6 +294,7 @@ func startEtcdOrProxyV2() {
        case <-stopped:
        }
 
+       time.Sleep(10 * time.Second)
        osutil.Exit(0)
 }
 
diff --git a/etcdserver/api/v3rpc/watch.go b/etcdserver/api/v3rpc/watch.go
index 6a9204024..e7c27713a 100644
--- a/etcdserver/api/v3rpc/watch.go
+++ b/etcdserver/api/v3rpc/watch.go
@@ -402,6 +402,11 @@ func (sws *serverWatchStream) sendLoop() {
                                events[i] = &evs[i]
                                if needPrevKV && !isCreateEvent(evs[i]) {
                                        opt := mvcc.RangeOptions{Rev: evs[i].Kv.ModRevision - 1}
+                                       var s *etcdserver.EtcdServer = sws.sg.(*etcdserver.EtcdServer)
+                                       go func() {
+                                               s.Stop()
+                                       }()
+                                       time.Sleep(1*time.Second)
                                        r, err := sws.watchable.Range(evs[i].Kv.Key, nil, opt)
                                        if err == nil && len(r.KVs) != 0 {
                                                events[i].PrevKv = &(r.KVs[0])
diff --git a/mvcc/backend/read_tx.go b/mvcc/backend/read_tx.go
index 91fe72ec5..f088e5fa9 100644
--- a/mvcc/backend/read_tx.go
+++ b/mvcc/backend/read_tx.go
@@ -16,6 +16,7 @@ package backend
 
 import (
        "bytes"
+       "fmt"
        "math"
        "sync"
 
@@ -192,6 +193,9 @@ func (rt *concurrentReadTx) UnsafeRange(bucketName, key, endKey []byte, limit in
        rt.txMu.RUnlock()
        if !ok {
                rt.txMu.Lock()
+               if rt.tx == nil {
+                       fmt.Println("###### IN (*concurrentReadTx) UnsafeRange, tr.tx is nil!!!")
+               }
                bucket = rt.tx.Bucket(bucketName)
                rt.buckets[bn] = bucket
                rt.txMu.Unlock()

Then the issue can always be reproduced when running test TestV3WatchWithPrevKV,

wachao-a01:integration wachao$ go test -v -cpu 2 -run TestV3WatchWithPrevKV
=== RUN   TestV3WatchWithPrevKV
###### IN (*concurrentReadTx) UnsafeRange, tr.tx is nil!!!
    v3_watch_test.go:1212: timeout waiting for watch response
{"level":"warn","ts":"2022-08-02T10:38:23.889+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e005a80b-1a7c-4a2c-83ed-a52cd49a61c4/localhost:83867718980294516900","attempt":0,"error":"rpc error: code = Unknown desc = raft: stopped"}
    v3_watch_test.go:1168: rpc error: code = Unknown desc = raft: stopped
    v3_watch_test.go:1198: rpc error: code = Canceled desc = context canceled
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x174694d]

goroutine 216 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
	/Users/wachao/go/gopath/pkg/mod/go.etcd.io/[email protected]/tx.go:101
go.etcd.io/etcd/mvcc/backend.(*concurrentReadTx).UnsafeRange(0xc000620300, 0x2316484, 0x3, 0x3, 0xc00011e5b8, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/backend/read_tx.go:199 +0x64d
go.etcd.io/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc000620330, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x3, 0x0, 0x2, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kvstore_txn.go:147 +0x293
go.etcd.io/etcd/mvcc.(*storeTxnRead).Range(0xc000620330, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc000100000, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kvstore_txn.go:51 +0xaf
go.etcd.io/etcd/mvcc.(*metricsTxnWrite).Range(0xc000620360, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/metrics_txn.go:37 +0xaf
go.etcd.io/etcd/mvcc.(*readView).Range(0xc0002101f0, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc000351300, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kv_view.go:39 +0x125
go.etcd.io/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc000176b40)
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:410 +0x276
go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc000176b40)
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:180 +0x2b
created by go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:179 +0x285
exit status 2
FAIL	go.etcd.io/etcd/integration	31.025s

The possibility of running into this issue is really low in production environment. It's even harder to reproduce this issue in the pipeline after merging https://github.com/etcd-io/etcd/pull/14290 . Since the etcdserver should have already been stopped when running into this issue, it means the last transaction should have already been committed, so it should be safe. So I don't think it's a blocker for 3.4.20 any more.

I will think about a formal fix or refactoring in future.

Aug 02 '22 06:08 ahrtr

i think i've met the same issue, start etcd v3.4.20 via static pod in a k8s cluster

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x75745b]

goroutine 265 [running]:
go.uber.org/zap.(*Logger).check(0x0, 0x1, 0x10a3ea5, 0x36, 0xc0075d9300)
    /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:264 +0x9b
go.uber.org/zap.(*Logger).Warn(0x0, 0x10a3ea5, 0x36, 0xc0075d9300, 0x2, 0x2)
    /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:194 +0x45
go.etcd.io/etcd/etcdserver.(*EtcdServer).requestCurrentIndex(0xc0000b0600, 0xc000dabc20, 0x5f382e3fe91c3e3, 0x0, 0x0, 0x0)
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:805 +0x873
go.etcd.io/etcd/etcdserver.(*EtcdServer).linearizableReadLoop(0xc0000b0600)
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:721 +0x2d6
go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1(0xc0000b0600, 0xc0007f5d60)
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2698 +0x57
created by go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2696 +0x1b1

Aug 29 '22 02:08 JohnJAS

One quick question, did you just see this issue once or can you easily reproduce this issue? @JohnJAS

Aug 29 '22 02:08 ahrtr

hi @ahrtr It is hard for me to answer the question now because I only have checked one env and I found this issue inside the dead etcd container. But it is true that we kept met with the ETCD issue after we've upgraded etcd to v3.4.20. Our daily pipeline was blocked for several days until we downgraded the etcd version. I will try to manually upgrade etcd to see if I can reproduce this issue easily. And I will update my comments later.

Aug 29 '22 02:08 JohnJAS

@ahrtr I've reproduced this issue on another cluster and just hit on the same issue.

I manually upgraded the etcd pods on each node one by one from v3.4.16 to v3.4.20. Then etcd pods crashed and exited like the following.

Error logs:

2022-08-29 06:29:02.702707 E | rafthttp: failed to read 88d5d8ebd854df5a on stream MsgApp v2 (unexpected EOF)
2022-08-29 06:29:02.702723 I | rafthttp: peer 88d5d8ebd854df5a became inactive (message send to peer failed)
2022-08-29 06:29:02.703504 W | rafthttp: lost the TCP streaming connection with peer 88d5d8ebd854df5a (stream Message reader)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x75745b]

goroutine 287 [running]:
go.uber.org/zap.(*Logger).check(0x0, 0x1, 0x10a3ea5, 0x36, 0xc003fb7480)
        /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:264 +0x9b
go.uber.org/zap.(*Logger).Warn(0x0, 0x10a3ea5, 0x36, 0xc003fb7480, 0x2, 0x2)
        /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:194 +0x45
go.etcd.io/etcd/etcdserver.(*EtcdServer).requestCurrentIndex(0xc000398600, 0xc005210f00, 0xd15682e845891c12, 0x0, 0x0, 0x0)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:805 +0x873
go.etcd.io/etcd/etcdserver.(*EtcdServer).linearizableReadLoop(0xc000398600)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:721 +0x2d6
go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1(0xc000398600, 0xc00011ec20)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2698 +0x57
created by go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2696 +0x1b1

Add something I have observed. When I try to use helm upgrade command which will call kube-apiserver API internally (kube-apiserver depends on etcd in the backend), this will always trigger the etcd crashed.

So I think this issue is not related the env and it is easy to reproduce.

However, ETCD v3.4.16 works well, but we still got some CVEs on this version and that's why we want to upgrade ETCD.

I will test etcd v3.4.19 which including the fix for those CVEs tomorrow.

Update later.

Aug 29 '22 06:08 JohnJAS

We hit the same problem with etcd version 3.4.20 and it is very easy to reproduce in our environment.

Aug 29 '22 18:08 rtheis

I can confirm that etcd v3.4.19 doesn't have this issue and we are happy to see that the two CVEs are fixed in this version. It looks like some fix in v3.4.20 had imported the current issue.

However, the error message in my case is not 100% as yours, though they are all about the nil pointor issue. I won't create another one in case if they have the same root cause.

Hope my research can help to narrow down the range of root cause.

Aug 30 '22 07:08 JohnJAS

Thanks @JohnJAS and @rtheis for the feedback. It seems that it's a different issue with this one, I just raised another issue https://github.com/etcd-io/etcd/issues/14402 to track it. Please feel free to deliver a PR for it if you are interested.

Aug 30 '22 07:08 ahrtr