[Bug]: mo killed by oom or panic by out of memory during stability test on standlone mode
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Environment
- Version or commit-id (e.g. v0.1.0 or 8b23a93):dfe1f363cf6f51c970852f15a40b7c2f80e06c33
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
during stability test on standlone, the mo was killed by oom : [Tue Oct 17 03:47:55 2023] [3311950] 0 3311950 3181 303 65536 0 0 sh [Tue Oct 17 03:47:55 2023] [3311952] 0 3311952 1391 14 53248 0 0 sshd [Tue Oct 17 03:47:55 2023] [3311956] 0 3311956 12064 472 106496 0 0 crond [Tue Oct 17 03:47:55 2023] [3311958] 0 3311958 248 1 20480 0 0 sshd [Tue Oct 17 03:47:55 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-17883.scope,task=mo-service,pid=2249519,uid=1000 [Tue Oct 17 03:47:55 2023] Out of memory: Killed process 2249519 (mo-service) total-vm:68841996kB, anon-rss:59740952kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:121416kB oom_score_adj:500 [Tue Oct 17 03:47:59 2023] oom_reaper: reaped process 2249519 (mo-service), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
or sometimes panic by out of memory: 88671 fatal error: runtime: out of memory 88672 88673 runtime stack: 88674 runtime.throw({0x36a3b85?, 0x0?}) 88675 /usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7ff389ff7000 sp=0x7ff389ff6fd0 pc=0x4cc13d 88676 runtime.sysMapOS(0xcad4400000, 0x37800000?)
panic log: mo-panic-service.tar.gz
oom profile: profile_oom.tar.gz
oom log: oom_service.tar.gz
Expected Behavior
No response
Steps to Reproduce
run a mo tester on stanlone and run the following test
1. bvt loop
git clone mo-tester
./run.sh -n -g -p xxxxx/matrixone/test/distributed/cases -e ddl -t 100
2. tpch loop
git clone mo-tpch, laod tpch 10G data
./run.sh -q all -s 10 -t 50000
3. sysbench mixed
git clone mo-load, and init 10-talbes-100000-per-table
./start.sh -m SYSBENCH -n 10 -s 100000
./start.sh -c cases/sysbench/mixed_10_100000/ -d 5000 -g
4. tpcc 10-10 long run
git clone mo-tpcc, and load tpcc 10 warehouse data, and modify runMins to 5000 in props.mo
./runBenchmark.sh props.mo
Additional information
No response
经过简单的分析,这个错误是mmap系统调用报的,错误码为ENOMEM。还在定位。。
目前mo_tables, mo_database, mo_columns会一直膨胀,无法gc。。长时间运行内存oom可能和这个有一些关心。。确认一下。。看prof btree的大小是有点大了。。。而且有一直变大的趋势。。。可能和issue #11650是一个问题。。
关联1.1.0的FEATRUE,先降级,修改milestone
通过分析,该环境的mo会有大量的内存不知道为啥无法被os回收(golang已经释放这部分内存),还在定位。很奇怪。。。
在移除环境的影响后。接下来的情况是真实的prof:
目前的内存占用主要有两个,一个是prune log,这个需要修复,另一个是logtail。。这个需要等待11650的安排。
prune log的问题泽兄在查,关联issue 12190
cache的问题,launch情况下的代码已经基本完成了
优先处理12385中.
今天在优先处理load的问题。
休假中,尚未处理
处理中,具体的问题为https://github.com/matrixorigin/matrixone/issues/12731
等待测试后,合入pr
继续处理mem分支的morpc相关的问题中。
经过测试eks和129都可以跑,https://github.com/matrixorigin/mo-auto-test/actions/runs/7015615926/job/19085534500。tke的stream closed在等复现,然后fix。
定位分支tpcc性能下降的问题中。
今天在修正#13219以及增加更多的metric中。
no process
no process
wait https://github.com/matrixorigin/matrixone/issues/12532
wait https://github.com/matrixorigin/matrixone/issues/12532
wait https://github.com/matrixorigin/matrixone/issues/12532
正在和存储的同事协商https://github.com/matrixorigin/matrixone/issues/12532
内存问题等待#12532
内存问题等待https://github.com/matrixorigin/matrixone/issues/12532
内存问题等待https://github.com/matrixorigin/matrixone/issues/12532
内存问题等待https://github.com/matrixorigin/matrixone/issues/12532
no process
no process
等待pr合入
等待pr合ru