[Bug]: [date 11.30] tke regression: dn oom
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Environment
- Version or commit-id (e.g. v0.1.0 or 8b23a93):f8dba6beb64044147e6415497f2107a3fe347603
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7047941218/job/19201992829
dn oom:
mo log: http://132.232.112.34:30088/explore?panes=%7B%22dVe%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20231130%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-12h%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1&orgId=1
profile: dnprofile.tar.gz
Expected Behavior
No response
Steps to Reproduce
No response
Additional information
No response
@LeftHandCold please help on it.
这个问题很难复现,具体现象是,5秒内突然增加12G内存
date 12.7 1.1-dev 也出现了dn重启
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7129835846/job/19415261479
commit:b324450130a6ddbd95e0a625158e23732f62d2fe
profile:
dnprofile.tar.gz
loki log: http://132.232.112.34:30088/explore?panes=%7B%22xwd%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20231207%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221701967780000%22,%22to%22:%221701968077000%22%7D%7D%7D&schemaVersion=1&orgId=1
并没有复现
还未复现
没进展,再做cache的pr
maybe fixed by https://github.com/matrixorigin/matrixone/pull/13952
dn oom又出现
main commit:c333d9e3604d74c408468e2ad77e17868be46f31
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7594992779/job/20693071443
整理一下目前已知的信息,备用:
- 提到的3次 oom 都发生在 Delete 场景 100w 的数据导入阶段,未实际执行 Delete. 作为对比,Select 测试也存在相似的数据导入过程,所以并不能断言数据就有概率导入一定触发 oom
- Delete 的上一个测试场景的 index insert,会建立20个表,进行 batch insert
- oom 发生在 01:06:44 附近
- 01:06:31 ~ 01:06:36 之间累积分配了 17G 内存
- 01:06:31 ~ 01:06:41 期间在做 global checkpoint,并且 oom 发生在完成之后
- 01:06:32 ~ oom 期间正在对 Delete 场景导入的数据进行刷盘。01:06:20 ~ 01:06:32 期间对 Insert 场景中的 20 个表有比较集中是的刷盘
commit: 1.1-dev
job: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7967811542/job/21768582521
目前通过增加 mlimit 和减少内存数据累积两个方面规避。待全部 pr 合入后观察
观察中
未出现
目前回归流程没有再出现,先关闭