[Bug]: cn crash by panic: runtime error: invalid memory address or nil pointer dereference during stability test on distributed mode
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch Name
1.2-dev
Commit ID
b5c2eaaf9209dd5209e749219a89ea3ec0503f74
Other Environment Information
- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:
Actual Behavior
during stability test on distributed mode , a cn by panic: runtime error: invalid memory address or nil pointer dereference: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x482f3c]
goroutine 1476 [running]: github.com/matrixorigin/matrixone/pkg/pb/status.(*Session).MarshalToSizedBuffer(0xc394e66840, {0xc52fd2202e, 0x25ad, 0xf2d2}) /go/src/github.com/matrixorigin/matrixone/pkg/pb/status/status.pb.go:510 +0x1955 github.com/matrixorigin/matrixone/pkg/pb/query.(*ShowProcessListResponse).MarshalToSizedBuffer(0xc323f64930, {0xc52fd2202e, 0x72ec, 0xf2d2}) /go/src/github.com/matrixorigin/matrixone/pkg/pb/query/query.pb.go:4214 +0x165 github.com/matrixorigin/matrixone/pkg/pb/query.(*Response).MarshalToSizedBuffer(0xc61421ec30, {0xc52fd2202e, 0x72ec, 0xf2d2}) /go/src/github.com/matrixorigin/matrixone/pkg/pb/query/query.pb.go:4508 +0x2365 github.com/matrixorigin/matrixone/pkg/pb/query.(*Response).MarshalTo(0xc61421ec30, {0xc52fd2202e, 0x72ec, 0xf2d2}) /go/src/github.com/matrixorigin/matrixone/pkg/pb/query/query.pb.go:4240 +0xae github.com/matrixorigin/matrixone/pkg/common/morpc.(*baseCodec).writeBody(0xc005f3cd40, 0xc013ecb980, {0x6e7e948, 0xc61421ec30}) /go/src/github.com/matrixorigin/matrixone/pkg/common/morpc/codec.go:425 +0xb02 github.com/matrixorigin/matrixone/pkg/common/morpc.(*baseCodec).Encode(0xc005f3cd40, {0x6333a20, 0xc394be94a0}, 0xc013ecb980, {0x6e1c620, 0xc013508d88}) /go/src/github.com/matrixorigin/matrixone/pkg/common/morpc/codec.go:274 +0xa45 github.com/matrixorigin/matrixone/pkg/common/morpc.(*messageCodec).Encode(0xc002e01200, {0x6333a20, 0xc394be94a0}, 0xc013ecb980, {0x6e1c620, 0xc013508d88}) /go/src/github.com/matrixorigin/matrixone/pkg/common/morpc/codec.go:136 +0x76 github.com/fagongzi/goetty/v2.(*baseIO).Write(0xc013ed4a00, {0x6333a20, 0xc394be94a0}, {0x0, 0x0}) /go/pkg/mod/github.com/matrixorigin/goetty/[email protected]/session.go:448 +0x102 github.com/matrixorigin/matrixone/pkg/common/morpc.(*server).startWriteLoop.func1({0x6e5fb58, 0xc0107e8780}) /go/src/github.com/matrixorigin/matrixone/pkg/common/morpc/server.go:381 +0xedb github.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1() /go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277 +0xdd created by github.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask in goroutine 1447 /go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:272 +0x118
mo-log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22gXl%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-b5c2eaa-20240511004320%5C%22%7D%20%7C%3D%20%60panic%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715389822883%22,%22to%22:%221715390599512%22%7D%7D%7D&schemaVersion=1&orgId=1
Expected Behavior
No response
Steps to Reproduce
1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with 75 terminals in one independant tenant,non-prepare mode
Additional information
No response
delay to 1.2.1, this issue is very strange and occur in very low frequency
不太可能发生的panic,先降级吧,后面如果还能出现就再处理
目前没找到原因
今天提个pr加点日志看一下哪个指针是空的
update on 5.28 job link: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9252001069/job/25448704513 mo-log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22pBr%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-bc98226-20240527165506%5C%22%7D%20%7C%3D%20%60FATAL%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221716839404101%22,%22to%22:%221716842575010%22%7D%7D%7D&schemaVersion=1&orgId=1
写代码模拟了空指针的问题,对一个struct中的string字段并发读写会导致空指针,因为go中的string更新不是原子的;但是对[]byte 没有模拟出来,所以现在猜测是前者导致的空指针问题,但是从代码逻辑上没有看出问题。 代码如下:
func main() {
a := &status.Session{
DB: "db1",
}
b := make([]byte, 10)
var wg sync.WaitGroup
ctx, cancel := context.WithTimeout(context.Background(), time.Second*20)
defer cancel()
wg.Add(2)
go func(ctx context.Context, m *status.Session) {
i := 1
for {
i = 1 - i
select {
case <-ctx.Done():
wg.Done()
return
default:
*m = status.Session{DB: fmt.Sprintf("db%d", i)}
}
}
}(ctx, a)
go func(ctx context.Context, b []byte, m *status.Session) {
for {
select {
case <-ctx.Done():
wg.Done()
return
default:
fmt.Printf("session: %p\n", m)
copy(b, m.DB)
}
}
}(ctx, b, a)
wg.Wait()
}
无进展
无进展
出现频率不高,目前也没有线索,DELAY到下个版本解决
没有思路
无进展
虽然用代码模拟出类似的panic错误,但是mo代码没看出来会导致问题的地方,很奇怪的panic。
最近都没有出现,先降级观察吧。
该问题没有再出现了,应该是已经修复了
fixed