matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: [date 6.27]tke regression: sysbench1000w random_range reported panic runtime error

Open heni02 opened this issue 1 year ago • 4 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

main

Commit ID

46953a016

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9698558168/job/26788773660 企业微信截图_045f81cc-763d-42f0-8c7f-74e20ddb9af6

mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22hhC%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-main-nightly-46953a016-20240627%5C%22%7D%20%7C%3D%20%60panic%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221719498487060%22,%22to%22:%221719541687060%22%7D%7D%7D&schemaVersion=1&orgId=1 {"level":"ERROR","time":"2024/06/28 01:55:58.991991 +0000","caller":"compile/scope.go:171","msg":"error: internal error: panic runtime error: invalid memory address or nil pointer dereference: \nruntime.panicmem\n\t/usr/local/go/src/runtime/panic.go:261\nruntime.sigpanic\n\t/usr/local/go/src/runtime/signal_unix.go:881\ngithub.com/matrixorigin/matrixone/pkg/container/vector.(*Vector).Length\n\t/go/src/github.com/matrixorigin/matrixone/pkg/container/vector/vector.go:143\ngithub.com/matrixorigin/matrixone/pkg/vm/process.(*Process).GetPrepareParamsAt\n\t/go/src/github.com/matrixorigin/matrixone/pkg/vm/process/types.go:449\ngithub.com/matrixorigin/matrixone/pkg/sql/colexec.(*ParamExpressionExecutor).Eval\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/colexec/evalExpression.go:341\ngithub.com/matrixorigin/matrixone/pkg/sql/colexec.(*FunctionExpressionExecutor).Eval\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/colexec/evalExpression.go:572\ngithub.com/matrixorigin/matrixone/pkg/sql/colexec.EvalExpressionOnce\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/colexec/evalExpression.go:242\ngithub.com/matrixorigin/matrixone/pkg/sql/plan.ConstantFold\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/plan/utils.go:1179\ngithub.com/matrixorigin/matrixone/pkg/sql/plan.ConstantFold\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/plan/utils.go:1165\ngithub.com/matrixorigin/matrixone/pkg/sql/plan.ConstantFold\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/plan/utils.go:1165\ngithub.com/matrixorigin/matrixone/pkg/sql/plan.ConstantFold\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/plan/utils.go:1165\ngithub.com/matrixorigin/matrixone/pkg/sql/colexec/filter.(*Argument).Prepare\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/colexec/filter/filter.go:46\ngithub.com/matrixorigin/matrixone/pkg/vm.Prepare\n\t/go/src/github.com/matrixorigin/matrixone/pkg/vm/vm.go:46\ngithub.com/matrixorigin/matrixone/pkg/vm/pipeline.(*Pipeline).Run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/vm/pipeline/pipeline.go:76\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*Scope).Run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scope.go:195\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*Scope).MergeRun.func1\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scope.go:258\ngithub.com/panjf2000/ants/v2.(*goWorker).run.func1\n\t/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695","span":{"trace_id":"0b6d75bb-316a-2e9d-62f6-8140e6a17f5c","span_id":"c68f61e7ad5081b5","kind":"remote"}}

Expected Behavior

No response

Steps to Reproduce

tke regression sysbench1000w random_range 100threads
sysbench --mysql-host=172.16.14.224 --mysql-port=6001 --mysql-user=dump --mysql-password=111  select_random_ranges.lua --mysql-db=sysbench_db --tables=10 --table_size=10000000 --threads=1000 --time=300 --report-interval=10 --range_selects=off --point_selects=1 prepare
sysbench --mysql-host=172.16.14.224 --mysql-port=6001 --mysql-user=dump --mysql-password=111  select_random_ranges.lua --mysql-db=sysbench_db --tables=10 --table_size=10000000 --threads=1000 --time=300 --report-interval=10 --range_selects=off --point_selects=1 run

Additional information

No response

heni02 avatar Jun 28 '24 02:06 heni02

二分结果: #17184导致的报错 @aunjgr cc 企业微信截图_d124db38-2411-476e-9f32-c90c157f99ef

#17184二分结果:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9730396169/job/26853674374

heni02 avatar Jun 30 '24 14:06 heni02

该bug是在执行包含用or连接多个prepare的'?'参数的查询时,process.PrepareParams为空指针引发的panic 目前定位到仅将3个表达式用or连接不会panic,4个时错误地没有执行computation_wrapper.go中的replacePlan()函数 sysbench 1000W 单表单并发 可以稳定复现,数据量为100W时plan发生变化不能复现。

zengyan1 avatar Jul 02 '24 13:07 zengyan1

复现步骤: 在分布式环境下,使用mo-sysbench

mysql> create database sysbench_db;

sysbench --mysql-host=127.0.0.1  --mysql-port=6001 --mysql-user=dump --mysql-password=111  select_random_ranges.lua --mysql-db=sysbench_db --tables=1 --table_size=10000000 --threads=1 --time=30 --report-interval=10 --range_selects=off --point_selects=1 prepare

sysbench --mysql-host=127.0.0.1  --mysql-port=6001 --mysql-user=dump --mysql-password=111  select_random_ranges.lua --mysql-db=sysbench_db --tables=1 --table_size=10000000 --threads=1 --time=30 --report-interval=10 --range_selects=off --point_selects=1 run

zengyan1 avatar Jul 03 '24 08:07 zengyan1

prepare的 ? 参数在execute时被放在process.PrepareParams中,但PrepareParams没有序列化。优化规则的不同可能导致了之前的版本没有在远程cn上计算包含PrepareParams的表达式。

zengyan1 avatar Jul 03 '24 13:07 zengyan1

该bug并非由pr#17184引入。下面是该pr合入main前的复现步骤: 分支:main commit: 04b97d0557a6db20aed2741bcd950efc47c14b65 环境:2cn tke集群

1、连接至mo

create database sysbench_db;

2、使用mo_sysbench工具生成数据

sysbench --mysql-host=127.0.0.1  --mysql-port=6001 --mysql-user=dump --mysql-password=111  select_random_ranges.lua --mysql-db=sysbench_db --tables=1 --table_size=10000000 --threads=1 --time=30 --report-interval=10 --range_selects=off --point_selects=1 prepare

3、再次连接至mo

use sysbench_db;
set session optimizer_hints="blockFilter=2";
prepare sql1 from 'select count(k) from sbtest1 where k between ? and ? or k between ? and ? or k between ? and ? or k between ? and ? or k between ? and ?';
set @a1=215006;
set @a2=215011;
set @a3=214990;
set @a4=214995;
set @a5=215901;
set @a6=215906;
set @a7=215050;
set @a8=215055;
set @a9=214997;
set @a10=215002;
execute sql1 using @a1,@a2,@a3,@a4,@a5,@a6,@a7,@a8,@a9,@a10;

即可复现Panic: 企业微信截图_e9c21a06-21bf-40a2-ae64-7f76ade50eb7

zengyan1 avatar Jul 04 '24 09:07 zengyan1

https://github.com/matrixorigin/matrixone/pull/17322 这个pr合入以后,sysbench random ranges测试不会再报这个问题了。 不过prepare不支持多cn的问题,还是需要从根源上修复的

badboynt1 avatar Jul 04 '24 09:07 badboynt1

等待pr合入

zengyan1 avatar Jul 05 '24 10:07 zengyan1

已合入

zengyan1 avatar Jul 08 '24 08:07 zengyan1

confirm, closed commit: b01bc0969ebb98f5d831d9f79c1c0311b27b9a7a job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9859053435/job/27247836878

heni02 avatar Jul 10 '24 03:07 heni02