matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: cn crash by panic: runtime error: invalid memory address or nil pointer dereference during stability test on distributed mode

Open aressu1985 opened this issue 1 year ago • 2 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

1.2-dev

Commit ID

b5c2eaaf9209dd5209e749219a89ea3ec0503f74

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:

Actual Behavior

during stability test on distributed mode , a cn by panic: runtime error: invalid memory address or nil pointer dereference: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x482f3c]

goroutine 1476 [running]: github.com/matrixorigin/matrixone/pkg/pb/status.(*Session).MarshalToSizedBuffer(0xc394e66840, {0xc52fd2202e, 0x25ad, 0xf2d2}) /go/src/github.com/matrixorigin/matrixone/pkg/pb/status/status.pb.go:510 +0x1955 github.com/matrixorigin/matrixone/pkg/pb/query.(*ShowProcessListResponse).MarshalToSizedBuffer(0xc323f64930, {0xc52fd2202e, 0x72ec, 0xf2d2}) /go/src/github.com/matrixorigin/matrixone/pkg/pb/query/query.pb.go:4214 +0x165 github.com/matrixorigin/matrixone/pkg/pb/query.(*Response).MarshalToSizedBuffer(0xc61421ec30, {0xc52fd2202e, 0x72ec, 0xf2d2}) /go/src/github.com/matrixorigin/matrixone/pkg/pb/query/query.pb.go:4508 +0x2365 github.com/matrixorigin/matrixone/pkg/pb/query.(*Response).MarshalTo(0xc61421ec30, {0xc52fd2202e, 0x72ec, 0xf2d2}) /go/src/github.com/matrixorigin/matrixone/pkg/pb/query/query.pb.go:4240 +0xae github.com/matrixorigin/matrixone/pkg/common/morpc.(*baseCodec).writeBody(0xc005f3cd40, 0xc013ecb980, {0x6e7e948, 0xc61421ec30}) /go/src/github.com/matrixorigin/matrixone/pkg/common/morpc/codec.go:425 +0xb02 github.com/matrixorigin/matrixone/pkg/common/morpc.(*baseCodec).Encode(0xc005f3cd40, {0x6333a20, 0xc394be94a0}, 0xc013ecb980, {0x6e1c620, 0xc013508d88}) /go/src/github.com/matrixorigin/matrixone/pkg/common/morpc/codec.go:274 +0xa45 github.com/matrixorigin/matrixone/pkg/common/morpc.(*messageCodec).Encode(0xc002e01200, {0x6333a20, 0xc394be94a0}, 0xc013ecb980, {0x6e1c620, 0xc013508d88}) /go/src/github.com/matrixorigin/matrixone/pkg/common/morpc/codec.go:136 +0x76 github.com/fagongzi/goetty/v2.(*baseIO).Write(0xc013ed4a00, {0x6333a20, 0xc394be94a0}, {0x0, 0x0}) /go/pkg/mod/github.com/matrixorigin/goetty/[email protected]/session.go:448 +0x102 github.com/matrixorigin/matrixone/pkg/common/morpc.(*server).startWriteLoop.func1({0x6e5fb58, 0xc0107e8780}) /go/src/github.com/matrixorigin/matrixone/pkg/common/morpc/server.go:381 +0xedb github.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1() /go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277 +0xdd created by github.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask in goroutine 1447 /go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:272 +0x118

mo-log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22gXl%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-b5c2eaa-20240511004320%5C%22%7D%20%7C%3D%20%60panic%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715389822883%22,%22to%22:%221715390599512%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

aressu1985 avatar May 11 '24 02:05 aressu1985

delay to 1.2.1, this issue is very strange and occur in very low frequency

aressu1985 avatar May 11 '24 08:05 aressu1985

不太可能发生的panic,先降级吧,后面如果还能出现就再处理

volgariver6 avatar May 11 '24 08:05 volgariver6

目前没找到原因

volgariver6 avatar May 20 '24 11:05 volgariver6

今天提个pr加点日志看一下哪个指针是空的

volgariver6 avatar May 23 '24 11:05 volgariver6

update on 5.28 job link: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9252001069/job/25448704513 mo-log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22pBr%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-bc98226-20240527165506%5C%22%7D%20%7C%3D%20%60FATAL%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221716839404101%22,%22to%22:%221716842575010%22%7D%7D%7D&schemaVersion=1&orgId=1

aressu1985 avatar May 28 '24 03:05 aressu1985

写代码模拟了空指针的问题,对一个struct中的string字段并发读写会导致空指针,因为go中的string更新不是原子的;但是对[]byte 没有模拟出来,所以现在猜测是前者导致的空指针问题,但是从代码逻辑上没有看出问题。 代码如下:

func main() {
	a := &status.Session{
		DB: "db1",
	}
	b := make([]byte, 10)

	var wg sync.WaitGroup

	ctx, cancel := context.WithTimeout(context.Background(), time.Second*20)
	defer cancel()

	wg.Add(2)
	go func(ctx context.Context, m *status.Session) {
		i := 1
		for {
			i = 1 - i
			select {
			case <-ctx.Done():
				wg.Done()
				return
			default:
				*m = status.Session{DB: fmt.Sprintf("db%d", i)}
			}
		}
	}(ctx, a)

	go func(ctx context.Context, b []byte, m *status.Session) {
		for {
			select {
			case <-ctx.Done():
				wg.Done()
				return
			default:
				fmt.Printf("session: %p\n", m)
				copy(b, m.DB)
			}
		}
	}(ctx, b, a)

	wg.Wait()

}

volgariver6 avatar May 29 '24 13:05 volgariver6

无进展

volgariver6 avatar Jun 03 '24 10:06 volgariver6

无进展

volgariver6 avatar Jun 06 '24 11:06 volgariver6

出现频率不高,目前也没有线索,DELAY到下个版本解决

aressu1985 avatar Jun 07 '24 10:06 aressu1985

没有思路

volgariver6 avatar Jun 12 '24 12:06 volgariver6

无进展

volgariver6 avatar Jun 17 '24 12:06 volgariver6

虽然用代码模拟出类似的panic错误,但是mo代码没看出来会导致问题的地方,很奇怪的panic。

最近都没有出现,先降级观察吧。

volgariver6 avatar Jun 20 '24 02:06 volgariver6

该问题没有再出现了,应该是已经修复了

volgariver6 avatar Oct 09 '24 04:10 volgariver6

fixed

aressu1985 avatar Oct 23 '24 04:10 aressu1985