[Bug]: [date 2.26] tke regression: tpch 1T query caused cn oom
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch Name
main
Commit ID
bbf99e1d9
Other Environment Information
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/8050556301/job/21987841021
stream_closed报错原因是cn oom被kill了
http://175.178.192.213:30088/d/cluster-detail-namespaced/cluster-detail-namespaced?orgId=1&var-namespace=mo-nightly-regression-20240226&var-account=All&var-interval=$__auto_interval_interval&var-cluster=.%2A&var-loki=loki
loki log: http://175.178.192.213:30088/explore?panes=%7B%22igz%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240226%5C%22%7D%20%7C%3D%20%60ERROR%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221708967400000%22,%22to%22:%221708967579000%22%7D%7D%7D&schemaVersion=1&orgId=1
Expected Behavior
No response
Steps to Reproduce
tke regression tpch1T test
Additional information
No response
从profile里没看出来有什么问题。 q18本来就是最耗内存的一条query,tpch 1T 跑q18是很接近oom边缘的。内存占用大概四十多G是比较合理的。 于此同时eks上跑了3cn 1T就没有oom。 所以要么是某个pr导致内存占用提升了一点点,刚好达到oom触发线。要么就是tke环境发生了某种变化,更容易触发oom了
date 2.29 128单机tpch1T也出现了oom的情况(新加的流程,第一次跑)commit:a9e9f2f4699bba03fb59958d5a9a81c90ec45cb4
https://github.com/matrixorigin/mo-nightly-regression/actions/runs/8097578526/job/22129082442
profile只有这个时间段,无法定位
128 toml配置:
service-type = "CN"
data-dir = "./mo-data"
[log] level = "info"
[cn] uuid = "dd1dccb4-4d3c-41f8-b482-5251dc7a41bf" port-base = 18000
[cn.txn] enable-leak-check = 1 max-active-ages = "2h"
[[fileservice]] name = "LOCAL" backend = "DISK"
[[fileservice]] name = "SHARED" backend = "DISK" data-dir = "mo-data/s3"
[fileservice.cache] memory-capacity = "32GB" disk-capacity = "8GB" disk-path = "mo-data/file-service-cache" disk-min-evict-interval = "7m" disk-evict-target = 0.8
[[fileservice]] name = "ETL" backend = "DISK-ETL"
在128单机上oom,看了profile里heap占用情况,memcache占用了72g,然而只设置了32g。 这是可能导致oom的一个影响因素。 @nnsgmsone 麻烦看一下这个内存占用是为啥?
no process
在128单机上,将tn的memcache配置改成默认配置后再尝试一次
no process
128单机上,修改dn的memcache后不再oom tke 3cn上,暂时没法修改memcache。 没有什么思路
这个就算cache重构也有概率oom,应该是查询本身内存占用过大导致的。
等待agg重构 或者spill来解决问题
https://github.com/matrixorigin/matrixone/issues/15669 这个issue完成后,tpch q18的内存分配会下降很多
date 5.13 regression又出现了oom,commit:b88a2b8e1
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9064806413/job/24936147501
同时tpch1T的性能也下降了
14号回归也出现了oom,nitao在手动抓取profile中
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9081256748
mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22Siy%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240514%5C%22%7D%20%7C%3D%20%60stream%20closed%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D,%7B%22refId%22:%22B%22,%22expr%22:%22%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D%7D%5D,%22range%22:%7B%22from%22:%221715760196000%22,%22to%22:%221715760285000%22%7D%7D%7D&schemaVersion=1&orgId=1
最后一次没有oom的commit:82d0eaa17 oom的commit:b88a2b8e1
date 5.16,5.17也出现Q18 oom,必现
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9129933033/job/25124440112
从daily测试里来看,3cn跑tpch1T也不会oom了,内存峰值回到了1.1版本的水平。 至于在tpcc和sysbench之后再跑4cn oom,是有内存没有释放导致的,放在其他issue里跟踪。
confirm,closed
commit:cda0e9073
https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9329068377