[Bug]: tke regression :sysbench bulk_insert 1000threads test cn and dn oom
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch Name
main
Commit ID
6e2ff3c8c4e142151f089e37167a4e79d115638a
Other Environment Information
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
测试环境:128服务器,64c/256G 测试场景:sysbench bulk_insert批量数据插入测试,并发1000 测试结果:sysbench bulk_insert thread 1000在运行到300s快结束时内存涨到98%,一直不释放(bulk_insert 100threads时测试正常,没有oom情况)
[github@mo-srv-128 mo-sysbench]$ sysbench --mysql-host=127.0.0.1 --mysql-port=6001 --mysql-user=dump --mysql-password=111 --db-ps-mode=disable bulk_insert.lua --mysql-db=sysbench_db --tables=10 --table_size=1000000 --threads=1000 --time=300 --report-interval=10 --range_selects=off --point_selects=1 run sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options: Number of threads: 1000 Report intermediate results every 10 second(s) Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 10s ] thds: 1000 tps: 3985006.44 qps: 42.80 (r/w/o: 0.00/42.80/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 1000 tps: 3997982.11 qps: 71.83 (r/w/o: 0.00/71.83/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 30s ] thds: 1000 tps: 2917418.92 qps: 81.10 (r/w/o: 0.00/81.10/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 40s ] thds: 1000 tps: 3812933.18 qps: 111.68 (r/w/o: 0.00/111.68/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 50s ] thds: 1000 tps: 2113941.85 qps: 64.52 (r/w/o: 0.00/64.52/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 60s ] thds: 1000 tps: 1579320.65 qps: 48.20 (r/w/o: 0.00/48.20/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 70s ] thds: 1000 tps: 976428.32 qps: 29.80 (r/w/o: 0.00/29.80/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 80s ] thds: 1000 tps: 1500685.87 qps: 45.80 (r/w/o: 0.00/45.80/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 90s ] thds: 1000 tps: 1124284.81 qps: 34.40 (r/w/o: 0.00/34.40/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 100s ] thds: 1000 tps: 1480610.13 qps: 45.10 (r/w/o: 0.00/45.10/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 110s ] thds: 1000 tps: 848639.94 qps: 25.90 (r/w/o: 0.00/25.90/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 120s ] thds: 1000 tps: 1359978.78 qps: 41.60 (r/w/o: 0.00/41.60/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 130s ] thds: 1000 tps: 1199031.96 qps: 36.50 (r/w/o: 0.00/36.50/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 140s ] thds: 1000 tps: 445625.09 qps: 13.60 (r/w/o: 0.00/13.60/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 150s ] thds: 1000 tps: 1058343.24 qps: 32.30 (r/w/o: 0.00/32.30/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 160s ] thds: 1000 tps: 458678.91 qps: 14.00 (r/w/o: 0.00/14.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 170s ] thds: 1000 tps: 943657.11 qps: 28.80 (r/w/o: 0.00/28.80/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 180s ] thds: 1000 tps: 792994.45 qps: 24.20 (r/w/o: 0.00/24.20/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 190s ] thds: 1000 tps: 1266270.01 qps: 38.90 (r/w/o: 0.00/38.90/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 200s ] thds: 1000 tps: 1223911.06 qps: 37.10 (r/w/o: 0.00/37.10/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 210s ] thds: 1000 tps: 1218959.57 qps: 37.20 (r/w/o: 0.00/37.20/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 220s ] thds: 1000 tps: 1369619.70 qps: 41.80 (r/w/o: 0.00/41.80/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 230s ] thds: 1000 tps: 1100615.59 qps: 33.60 (r/w/o: 0.00/33.60/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 240s ] thds: 1000 tps: 993125.51 qps: 30.30 (r/w/o: 0.00/30.30/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 250s ] thds: 1000 tps: 1389278.58 qps: 42.40 (r/w/o: 0.00/42.40/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 260s ] thds: 1000 tps: 1376159.42 qps: 42.00 (r/w/o: 0.00/42.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 270s ] thds: 1000 tps: 946944.23 qps: 28.90 (r/w/o: 0.00/28.90/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 280s ] thds: 1000 tps: 1363067.80 qps: 41.60 (r/w/o: 0.00/41.60/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 290s ] thds: 1000 tps: 1310638.75 qps: 40.00 (r/w/o: 0.00/40.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 300s ] thds: 1000 tps: 1310640.10 qps: 40.00 (r/w/o: 0.00/40.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 310s ] thds: 1000 tps: 18.00 qps: 18.00 (r/w/o: 0.00/18.00/0.00) lat (ms,95%): 30469.90 err/s: 0.00 reconn/s: 0.00 [ 320s ] thds: 1000 tps: 27.80 qps: 27.80 (r/w/o: 0.00/27.80/0.00) lat (ms,95%): 33946.25 err/s: 0.00 reconn/s: 0.00 [ 330s ] thds: 1000 tps: 17.30 qps: 17.30 (r/w/o: 0.00/17.30/0.00) lat (ms,95%): 38506.38 err/s: 0.00 reconn/s: 0.00 [ 340s ] thds: 1000 tps: 15.60 qps: 15.60 (r/w/o: 0.00/15.60/0.00) lat (ms,95%): 41382.17 err/s: 0.00 reconn/s: 0.00 [ 350s ] thds: 1000 tps: 2.40 qps: 2.40 (r/w/o: 0.00/2.40/0.00) lat (ms,95%): 54214.49 err/s: 0.00 reconn/s: 0.00 [ 360s ] thds: 1000 tps: 2.50 qps: 2.50 (r/w/o: 0.00/2.50/0.00) lat (ms,95%): 63752.41 err/s: 0.00 reconn/s: 0.00 [ 370s ] thds: 1000 tps: 1.10 qps: 1.10 (r/w/o: 0.00/1.10/0.00) lat (ms,95%): 71026.01 err/s: 0.00 reconn/s: 0.00 [ 380s ] thds: 1000 tps: 1.30 qps: 1.30 (r/w/o: 0.00/1.30/0.00) lat (ms,95%): 83521.57 err/s: 0.00 reconn/s: 0.00 [ 390s ] thds: 1000 tps: 1.80 qps: 1.80 (r/w/o: 0.00/1.80/0.00) lat (ms,95%): 89759.24 err/s: 0.00 reconn/s: 0.00 [ 400s ] thds: 1000 tps: 0.30 qps: 0.30 (r/w/o: 0.00/0.30/0.00) lat (ms,95%): 96462.77 err/s: 0.00 reconn/s: 0.00 [ 410s ] thds: 1000 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 420s ] thds: 1000 tps: 0.10 qps: 0.10 (r/w/o: 0.00/0.10/0.00) lat (ms,95%): 100000.00 err/s: 0.00 reconn/s: 0.00 [ 430s ] thds: 1000 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 440s ] thds: 1000 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 451s ] thds: 1000 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 460s ] thds: 1000 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 [ 470s ] thds: 1000 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
运行到300s后mo-service资源使用情况,cpu基本占满,导致服务器无法连接登录,测试了两次都是一样的现象
运行到400s时内存使用情况:
[root@mo-srv-128 ~]# free -m
total used free shared buff/cache available
Mem: 257392 256336 632 20 423 781
Swap: 19999 19999 0
[root@mo-srv-128 ~]# free -m
total used free shared buff/cache available
Mem: 257392 256364 609 17 418 758
Swap: 19999 19999 0
sbtest表schema:
CREATE TABLE sbtest1 (
id INT NOT NULL,
k INT DEFAULT '0',
PRIMARY KEY (id)
)
mo toml配置: toml.tar.gz `cn: service-type = "CN" data-dir = "./mo-data"
[log] level = "info"
[cn] uuid = "dd1dccb4-4d3c-41f8-b482-5251dc7a41bf" port-base = 18000
[cn.txn] enable-leak-check = 1 max-active-ages = "2h"
[[fileservice]] name = "LOCAL" backend = "DISK"
[[fileservice]] name = "SHARED" backend = "DISK" data-dir = "mo-data/s3"
[fileservice.cache] memory-capacity = "32GB" disk-capacity = "8GB" disk-path = "mo-data/file-service-cache" disk-min-evict-interval = "7m" disk-evict-target = 0.8
[[fileservice]] name = "ETL" backend = "DISK-ETL" `
Expected Behavior
No response
Steps to Reproduce
1.使用mo-sysbench测试工具:https://github.com/matrixorigin/mo-sysbench.git
2. cd sysbench执行以下命令
sysbench bulk_insert 命令:
sysbench --mysql-host=127.0.0.1 --mysql-port=6001 --mysql-user=dump --mysql-password=111 --db-ps-mode=disable bulk_insert.lua --mysql-db=sysbench_db --tables=10 --table_size=1000000 --threads=1000 --time=300 --report-interval=10 --range_selects=off --point_selects=1 prepare
sysbench --mysql-host=127.0.0.1 --mysql-port=6001 --mysql-user=dump --mysql-password=111 --db-ps-mode=disable bulk_insert.lua --mysql-db=sysbench_db --tables=10 --table_size=1000000 --threads=1000 --time=300 --report-interval=10 --range_selects=off --point_selects=1 run
Additional information
No response
profile: mo_profile0329.tar.gz
第三次测试:(打开gc日志)
mo sysbench bulk_insert 1000threads 测试运行大概到380s时被oom kill,执行tps结果和第一次基本差不多
运行到380s时内存情况:
mo log: mo_log2.tar.gz
profile: mo_profile_bulk.tar.gz
last gc log: gc 58 @523.221s 2%: 0.25+4656+0.31 ms clock, 16+235/74447/169465+20 ms cpu, 232580->236203->147538 MB, 247313 MB goal, 18 MB stacks, 0 MB globals, 64 P
处理其他 s-1
处理其他 s-1
没有进展,在做其他issue
与 tn 相关占用巨大内存的地方有两个:
- dequeuePreparing -> MutIndex 建立内存索引
- HandleRowsInsert
初步看来主要原因是由于 flush 不及时导致,没有及时清理内存中的数据。
增快 flush 频率,可以避免 OOM
在做其他 issue
#https://github.com/matrixorigin/matrixone/pull/15580 merge 后,oom 应该可以避免 @heni02 后面会进一步减少 tn 的内存消耗
自测中
SQL statistics:
queries performed:
read: 0
write: 28595
other: 0
total: 28595
transactions: 917734260 (981824.22 per sec.)
queries: 28595 (30.59 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 934.7219s
total number of events: 917734260
Latency (ms):
min: 0.00
avg: 1.00
max: 55095.98
95th percentile: 0.00
sum: 913347614.78
Threads fairness:
events (avg/stddev): 917734.2600/34714.55
execution time (avg/stddev): 913.3476/9.71
自测 15分钟已经没有再出现 OOM.
@heni02
今晚回归测试验证
main commit:2eeef9246a17f1dfe7265f911976caa68626defc
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9019462365/job/24803638728
昨晚回归测试该流程时段内存基本占满了,但没有oom
三分钟后内存释放完
三分钟期间,连接mo失败报错
mo log and profile plz contact me
在看其他issue
处理其他 issue
处理其他 issue
@w-zr please continue to investigate this issue
分布式环境也存在这个问题:
job url:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9211125528/job/25354293997
commit:e4c504a606585cb373a5bb7c452e804469716610
date5.26回归,bulk_insert导致dn oom
job: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9244059586/job/25434572970
https://grafana.ci.matrixorigin.cn/d/cluster-detail-namespaced/cluster-detail-namespaced?orgId=1&var-namespace=mo-nightly-regression-20240526&var-account=All&var-interval=$__auto_interval_interval&var-cluster=.%2A&var-loki=loki&from=1716734826666&to=1716775847600
2次oom时间
profile: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22CU2%22:%7B%22datasource%22:%22pyroscope%22,%22queries%22:%5B%7B%22groupBy%22:%5B%5D,%22labelSelector%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240526%5C%22,pod%3D%5C%22nightly-regression-dis-dn-0%5C%22%7D%22,%22queryType%22:%22both%22,%22refId%22:%22A%22,%22profileTypeId%22:%22process_cpu:cpu:nanoseconds:cpu:nanoseconds%22,%22datasource%22:%7B%22type%22:%22grafana-pyroscope-datasource%22,%22uid%22:%22pyroscope%22%7D%7D%5D,%22range%22:%7B%22from%22:%221716771420000%22,%22to%22:%221716771480000%22%7D%7D%7D&schemaVersion=1&orgId=1
两次OOM前的内存情况。
正在尝试在129上复现
还未复现
目前在查其他内存相关问题
目前没有进展
现在问题的现象是oom,修改该问题单标题
目前没有进展
目前没有进展
目前没有进展
目前没有进展
目前没有进展