matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: [date 2.26] tke regression: tpch 1T query caused cn oom

Open heni02 opened this issue 1 year ago • 14 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

main

Commit ID

bbf99e1d9

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/8050556301/job/21987841021 企业微信截图_7f51771b-412a-42c2-983a-50c7a95c8c7f

stream_closed报错原因是cn oom被kill了 企业微信截图_ac33d99d-981d-4c37-b214-3c22a049960f http://175.178.192.213:30088/d/cluster-detail-namespaced/cluster-detail-namespaced?orgId=1&var-namespace=mo-nightly-regression-20240226&var-account=All&var-interval=$__auto_interval_interval&var-cluster=.%2A&var-loki=loki

loki log: http://175.178.192.213:30088/explore?panes=%7B%22igz%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240226%5C%22%7D%20%7C%3D%20%60ERROR%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221708967400000%22,%22to%22:%221708967579000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

tke regression tpch1T test

Additional information

No response

heni02 avatar Feb 27 '24 03:02 heni02

复现,抓取的profle如下: 企业微信截图_3b477b0b-f606-4671-8410-d20b6795f8eb

mo_prof.tar.gz

heni02 avatar Feb 27 '24 09:02 heni02

从profile里没看出来有什么问题。 q18本来就是最耗内存的一条query,tpch 1T 跑q18是很接近oom边缘的。内存占用大概四十多G是比较合理的。 于此同时eks上跑了3cn 1T就没有oom。 所以要么是某个pr导致内存占用提升了一点点,刚好达到oom触发线。要么就是tke环境发生了某种变化,更容易触发oom了

badboynt1 avatar Feb 29 '24 00:02 badboynt1

image

image

badboynt1 avatar Feb 29 '24 00:02 badboynt1

date 2.29 128单机tpch1T也出现了oom的情况(新加的流程,第一次跑)commit:a9e9f2f4699bba03fb59958d5a9a81c90ec45cb4 https://github.com/matrixorigin/mo-nightly-regression/actions/runs/8097578526/job/22129082442 企业微信截图_d939ef4c-a7b3-4dc7-84c0-49e1139c7775 企业微信截图_f5bba1ec-fd0c-4d00-a902-9f60449a0520

profile只有这个时间段,无法定位 企业微信截图_3ad9d247-ffca-4211-8a2d-9f2984bed740

128 toml配置: 企业微信截图_9193cf8e-f03f-48fc-ae0d-a72266e78fc7 service-type = "CN" data-dir = "./mo-data"

[log] level = "info"

[cn] uuid = "dd1dccb4-4d3c-41f8-b482-5251dc7a41bf" port-base = 18000

[cn.txn] enable-leak-check = 1 max-active-ages = "2h"

[[fileservice]] name = "LOCAL" backend = "DISK"

[[fileservice]] name = "SHARED" backend = "DISK" data-dir = "mo-data/s3"

[fileservice.cache] memory-capacity = "32GB" disk-capacity = "8GB" disk-path = "mo-data/file-service-cache" disk-min-evict-interval = "7m" disk-evict-target = 0.8

[[fileservice]] name = "ETL" backend = "DISK-ETL"

heni02 avatar Mar 01 '24 07:03 heni02

在128单机上oom,看了profile里heap占用情况,memcache占用了72g,然而只设置了32g。 这是可能导致oom的一个影响因素。 @nnsgmsone 麻烦看一下这个内存占用是为啥?

企业微信截图_17096048008536

badboynt1 avatar Mar 05 '24 02:03 badboynt1

no process

nnsgmsone avatar Mar 08 '24 10:03 nnsgmsone

在128单机上,将tn的memcache配置改成默认配置后再尝试一次

badboynt1 avatar Mar 13 '24 07:03 badboynt1

no process

nnsgmsone avatar Mar 18 '24 11:03 nnsgmsone

128单机上,修改dn的memcache后不再oom tke 3cn上,暂时没法修改memcache。 没有什么思路

badboynt1 avatar Mar 20 '24 09:03 badboynt1

这个就算cache重构也有概率oom,应该是查询本身内存占用过大导致的。

nnsgmsone avatar Mar 25 '24 10:03 nnsgmsone

等待agg重构 或者spill来解决问题

badboynt1 avatar Mar 28 '24 10:03 badboynt1

https://github.com/matrixorigin/matrixone/issues/15669 这个issue完成后,tpch q18的内存分配会下降很多

badboynt1 avatar Apr 29 '24 06:04 badboynt1

date 5.13 regression又出现了oom,commit:b88a2b8e1 job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9064806413/job/24936147501 企业微信截图_5296c756-7bf2-4636-b5ef-8f8f3d6ea252 企业微信截图_b659e9cd-0924-44d0-b4a1-7cfeb3766fff 企业微信截图_508836ba-0f32-4685-a801-06e1120445b1 同时tpch1T的性能也下降了 企业微信截图_c92cfec1-4d70-4f41-94c5-33cbd967bbe5

14号回归也出现了oom,nitao在手动抓取profile中 job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9081256748 image

mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22Siy%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240514%5C%22%7D%20%7C%3D%20%60stream%20closed%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D,%7B%22refId%22:%22B%22,%22expr%22:%22%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D%7D%5D,%22range%22:%7B%22from%22:%221715760196000%22,%22to%22:%221715760285000%22%7D%7D%7D&schemaVersion=1&orgId=1

最后一次没有oom的commit:82d0eaa17 oom的commit:b88a2b8e1

heni02 avatar May 15 '24 08:05 heni02

date 5.16,5.17也出现Q18 oom,必现 job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9129933033/job/25124440112 image image

heni02 avatar May 18 '24 06:05 heni02

从daily测试里来看,3cn跑tpch1T也不会oom了,内存峰值回到了1.1版本的水平。 至于在tpcc和sysbench之后再跑4cn oom,是有内存没有释放导致的,放在其他issue里跟踪。

badboynt1 avatar May 21 '24 01:05 badboynt1

confirm,closed commit:cda0e9073 https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9329068377 企业微信截图_cc2dc0b3-d662-44da-8e59-64a0a4a520ed

heni02 avatar Jun 03 '24 10:06 heni02