[Bug]: dn crashed by oom continuously after stability test on distributed mode for about 24 hours
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch Name
main
Commit ID
42d05c2
Other Environment Information
- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:
Actual Behavior
after stability test on distributed mode for about 24 hours, the dn crashed by oom continously:
dn-log:
https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%223sB%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-42d05c2-20240430%5C%22,%20pod%3D%5C%22stability-regression-dis-dn-0%5C%22%7D%20%7C%3D%20%60panic%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221714526646000%22,%22to%22:%221714721523000%22%7D%7D%7D&schemaVersion=1&orgId=1
dn memory and cpu usage:
https://shanghai.idc.matrixorigin.cn:30001/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&var-datasource=prometheus&var-cluster=&var-namespace=mo-nightly-42d05c2-20240430&from=1714521600000&to=1714780799000
Expected Behavior
No response
Steps to Reproduce
1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with 75 terminals in one independant tenant,non-prepare mode
Additional information
No response
从04/30 20:01:28开始14h一直没有做checkpoint,导致没有做truncate,WAL数量太多,replay时OOM。
是prepare compact失败导致一直没能做ckp:
还不知道没有进行check point的原因,目前稳定性测试也就只出现过一次,目前已经增加了没有checkpoint的告警,生产可以人为去干预
加了日志 https://github.com/matrixorigin/matrixone/pull/16820
fixed by #18037
fixed