matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: dn crashed by oom continuously after stability test on distributed mode for about 24 hours

Open aressu1985 opened this issue 1 year ago • 2 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

main

Commit ID

42d05c2

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:

Actual Behavior

after stability test on distributed mode for about 24 hours, the dn crashed by oom continously: image

dn-log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%223sB%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-42d05c2-20240430%5C%22,%20pod%3D%5C%22stability-regression-dis-dn-0%5C%22%7D%20%7C%3D%20%60panic%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221714526646000%22,%22to%22:%221714721523000%22%7D%7D%7D&schemaVersion=1&orgId=1 image

dn memory and cpu usage:

https://shanghai.idc.matrixorigin.cn:30001/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&var-datasource=prometheus&var-cluster=&var-namespace=mo-nightly-42d05c2-20240430&from=1714521600000&to=1714780799000

image image

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

aressu1985 avatar May 06 '24 08:05 aressu1985

从04/30 20:01:28开始14h一直没有做checkpoint,导致没有做truncate,WAL数量太多,replay时OOM。

是prepare compact失败导致一直没能做ckp: image

jiangxinmeng1 avatar May 09 '24 03:05 jiangxinmeng1

还不知道没有进行check point的原因,目前稳定性测试也就只出现过一次,目前已经增加了没有checkpoint的告警,生产可以人为去干预

aressu1985 avatar Jun 07 '24 02:06 aressu1985

加了日志 https://github.com/matrixorigin/matrixone/pull/16820

jiangxinmeng1 avatar Jun 12 '24 10:06 jiangxinmeng1

fixed by #18037

jiangxinmeng1 avatar Aug 12 '24 11:08 jiangxinmeng1

fixed

aressu1985 avatar Aug 19 '24 02:08 aressu1985