matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: mo killed by oom or panic by out of memory during stability test on standlone mode

Open aressu1985 opened this issue 2 years ago • 32 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):dfe1f363cf6f51c970852f15a40b7c2f80e06c33
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

during stability test on standlone, the mo was killed by oom : [Tue Oct 17 03:47:55 2023] [3311950] 0 3311950 3181 303 65536 0 0 sh [Tue Oct 17 03:47:55 2023] [3311952] 0 3311952 1391 14 53248 0 0 sshd [Tue Oct 17 03:47:55 2023] [3311956] 0 3311956 12064 472 106496 0 0 crond [Tue Oct 17 03:47:55 2023] [3311958] 0 3311958 248 1 20480 0 0 sshd [Tue Oct 17 03:47:55 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-17883.scope,task=mo-service,pid=2249519,uid=1000 [Tue Oct 17 03:47:55 2023] Out of memory: Killed process 2249519 (mo-service) total-vm:68841996kB, anon-rss:59740952kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:121416kB oom_score_adj:500 [Tue Oct 17 03:47:59 2023] oom_reaper: reaped process 2249519 (mo-service), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

or sometimes panic by out of memory: 88671 fatal error: runtime: out of memory 88672 88673 runtime stack: 88674 runtime.throw({0x36a3b85?, 0x0?}) 88675 /usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7ff389ff7000 sp=0x7ff389ff6fd0 pc=0x4cc13d 88676 runtime.sysMapOS(0xcad4400000, 0x37800000?)

panic log: mo-panic-service.tar.gz

oom profile: profile_oom.tar.gz

oom log: oom_service.tar.gz

Expected Behavior

No response

Steps to Reproduce

run a mo tester on stanlone and run the following test
1. bvt loop
   git clone mo-tester
   ./run.sh -n -g -p  xxxxx/matrixone/test/distributed/cases -e ddl -t 100

2. tpch loop 
   git clone mo-tpch, laod tpch 10G data
   ./run.sh -q all -s 10 -t 50000

3. sysbench mixed
   git clone mo-load, and init 10-talbes-100000-per-table 
   ./start.sh -m SYSBENCH -n 10 -s 100000
   ./start.sh -c cases/sysbench/mixed_10_100000/ -d 5000 -g

4. tpcc 10-10 long run
   git clone mo-tpcc, and load tpcc 10 warehouse data, and modify runMins to 5000 in props.mo
   ./runBenchmark.sh props.mo

Additional information

No response

aressu1985 avatar Oct 17 '23 01:10 aressu1985

经过简单的分析,这个错误是mmap系统调用报的,错误码为ENOMEM。还在定位。。

nnsgmsone avatar Oct 17 '23 08:10 nnsgmsone

目前mo_tables, mo_database, mo_columns会一直膨胀,无法gc。。长时间运行内存oom可能和这个有一些关心。。确认一下。。看prof btree的大小是有点大了。。。而且有一直变大的趋势。。。可能和issue #11650是一个问题。。

nnsgmsone avatar Oct 17 '23 09:10 nnsgmsone

关联1.1.0的FEATRUE,先降级,修改milestone

aressu1985 avatar Oct 18 '23 02:10 aressu1985

通过分析,该环境的mo会有大量的内存不知道为啥无法被os回收(golang已经释放这部分内存),还在定位。很奇怪。。。

nnsgmsone avatar Oct 18 '23 08:10 nnsgmsone

在移除环境的影响后。接下来的情况是真实的prof: 截图 2023-10-19 10-11-55 目前的内存占用主要有两个,一个是prune log,这个需要修复,另一个是logtail。。这个需要等待11650的安排。

nnsgmsone avatar Oct 19 '23 02:10 nnsgmsone

prune log的问题泽兄在查,关联issue 12190

nnsgmsone avatar Oct 24 '23 10:10 nnsgmsone

cache的问题,launch情况下的代码已经基本完成了

nnsgmsone avatar Oct 27 '23 10:10 nnsgmsone

优先处理12385中.

nnsgmsone avatar Nov 01 '23 10:11 nnsgmsone

今天在优先处理load的问题。

nnsgmsone avatar Nov 07 '23 10:11 nnsgmsone

休假中,尚未处理

nnsgmsone avatar Nov 10 '23 11:11 nnsgmsone

处理中,具体的问题为https://github.com/matrixorigin/matrixone/issues/12731

nnsgmsone avatar Nov 15 '23 10:11 nnsgmsone

等待测试后,合入pr

nnsgmsone avatar Nov 20 '23 10:11 nnsgmsone

继续处理mem分支的morpc相关的问题中。

nnsgmsone avatar Nov 23 '23 10:11 nnsgmsone

经过测试eks和129都可以跑,https://github.com/matrixorigin/mo-auto-test/actions/runs/7015615926/job/19085534500。tke的stream closed在等复现,然后fix。

nnsgmsone avatar Nov 28 '23 10:11 nnsgmsone

定位分支tpcc性能下降的问题中。

nnsgmsone avatar Dec 01 '23 10:12 nnsgmsone

今天在修正#13219以及增加更多的metric中。

nnsgmsone avatar Dec 06 '23 10:12 nnsgmsone

no process

nnsgmsone avatar Dec 08 '23 10:12 nnsgmsone

no process

nnsgmsone avatar Dec 13 '23 10:12 nnsgmsone

wait https://github.com/matrixorigin/matrixone/issues/12532

nnsgmsone avatar Dec 20 '23 10:12 nnsgmsone

wait https://github.com/matrixorigin/matrixone/issues/12532

nnsgmsone avatar Dec 25 '23 10:12 nnsgmsone

wait https://github.com/matrixorigin/matrixone/issues/12532

nnsgmsone avatar Dec 28 '23 10:12 nnsgmsone

正在和存储的同事协商https://github.com/matrixorigin/matrixone/issues/12532

nnsgmsone avatar Jan 03 '24 10:01 nnsgmsone

内存问题等待#12532

nnsgmsone avatar Jan 08 '24 10:01 nnsgmsone

内存问题等待https://github.com/matrixorigin/matrixone/issues/12532

nnsgmsone avatar Jan 25 '24 10:01 nnsgmsone

内存问题等待https://github.com/matrixorigin/matrixone/issues/12532

nnsgmsone avatar Jan 30 '24 10:01 nnsgmsone

内存问题等待https://github.com/matrixorigin/matrixone/issues/12532

nnsgmsone avatar Feb 02 '24 10:02 nnsgmsone

no process

nnsgmsone avatar Feb 21 '24 13:02 nnsgmsone

no process

nnsgmsone avatar Feb 26 '24 10:02 nnsgmsone

等待pr合入

nnsgmsone avatar Feb 29 '24 10:02 nnsgmsone

等待pr合ru

nnsgmsone avatar Mar 05 '24 10:03 nnsgmsone