matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: [date 11.30] tke regression: dn oom

Open heni02 opened this issue 2 years ago • 12 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):f8dba6beb64044147e6415497f2107a3fe347603
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7047941218/job/19201992829 企业微信截图_f9c8bc1c-c8fc-4e9e-abd0-d65f7005bb38 企业微信截图_5eab3395-be94-4e18-8de9-46269aa06dce dn oom: 截屏2023-12-01 上午10 21 09的副本

mo log: http://132.232.112.34:30088/explore?panes=%7B%22dVe%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20231130%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-12h%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1&orgId=1

profile: dnprofile.tar.gz

Expected Behavior

No response

Steps to Reproduce

No response

Additional information

No response

heni02 avatar Dec 01 '23 02:12 heni02

@LeftHandCold please help on it.

volgariver6 avatar Dec 01 '23 02:12 volgariver6

这个问题很难复现,具体现象是,5秒内突然增加12G内存

LeftHandCold avatar Dec 06 '23 11:12 LeftHandCold

date 12.7 1.1-dev 也出现了dn重启 企业微信截图_b2fc69c3-3615-435a-a907-b93028582041 image job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7129835846/job/19415261479 commit:b324450130a6ddbd95e0a625158e23732f62d2fe 企业微信截图_8f5eda54-15f6-4396-a771-33b80f7f9c39 profile: dnprofile.tar.gz

loki log: http://132.232.112.34:30088/explore?panes=%7B%22xwd%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20231207%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221701967780000%22,%22to%22:%221701968077000%22%7D%7D%7D&schemaVersion=1&orgId=1

heni02 avatar Dec 08 '23 02:12 heni02

并没有复现

LeftHandCold avatar Dec 28 '23 10:12 LeftHandCold

还未复现

LeftHandCold avatar Jan 11 '24 10:01 LeftHandCold

没进展,再做cache的pr

LeftHandCold avatar Jan 16 '24 10:01 LeftHandCold

maybe fixed by https://github.com/matrixorigin/matrixone/pull/13952

XuPeng-SH avatar Jan 18 '24 03:01 XuPeng-SH

dn oom又出现 main commit:c333d9e3604d74c408468e2ad77e17868be46f31 job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7594992779/job/20693071443 企业微信截图_1dd8b656-1717-4127-8548-b9c3a8c4a6fb

企业微信截图_4751a9f3-8fe2-43d1-9f8a-1cace2f686c7 loki: http://175.178.192.213:30088/explore?panes=%7B%22CSG%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240120%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221705799204344%22,%22to%22:%221705799206344%22%7D%7D%7D&schemaVersion=1&orgId=1

heni02 avatar Jan 22 '24 06:01 heni02

整理一下目前已知的信息,备用:

  1. 提到的3次 oom 都发生在 Delete 场景 100w 的数据导入阶段,未实际执行 Delete. 作为对比,Select 测试也存在相似的数据导入过程,所以并不能断言数据就有概率导入一定触发 oom
  2. Delete 的上一个测试场景的 index insert,会建立20个表,进行 batch insert
  3. oom 发生在 01:06:44 附近
  4. 01:06:31 ~ 01:06:36 之间累积分配了 17G 内存
  5. 01:06:31 ~ 01:06:41 期间在做 global checkpoint,并且 oom 发生在完成之后
  6. 01:06:32 ~ oom 期间正在对 Delete 场景导入的数据进行刷盘。01:06:20 ~ 01:06:32 期间对 Insert 场景中的 20 个表有比较集中是的刷盘

loki address

aptend avatar Jan 25 '24 10:01 aptend

commit: 1.1-dev job: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7967811542/job/21768582521 image

loki address

aptend avatar Feb 21 '24 08:02 aptend

目前通过增加 mlimit 和减少内存数据累积两个方面规避。待全部 pr 合入后观察

aptend avatar Mar 05 '24 10:03 aptend

观察中

aptend avatar Mar 18 '24 10:03 aptend

未出现

aptend avatar Mar 26 '24 10:03 aptend

目前回归流程没有再出现,先关闭

heni02 avatar Apr 03 '24 02:04 heni02