doris
doris copied to clipboard
[enhancement](memory) Jemalloc performance optimization and compatibility with MemTracker
Proposed changes
Issue Number: close #xxx
Problem summary
- Before jemalloc was compiled with arrow, it was compiled separately
- Modify the default parameters of jemalloc to achieve better performance and lower memory usage. This will significantly improve multi-threading and high concurrency memory performance.
- Jemalloc compatible mem tracker, which is consistent with the query mem tracker value of tcmalloc
Test commit: Wed Aug 31 1a198b3777ba0c8eb7c2ff31f226a816bcd4a472, does not include #12436 Optimize tcmalloc performance, so the test result of tcmalloc may be lower than the latest code.
refer to: https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md https://jemalloc.net/jemalloc.3.html https://github.com/jemalloc/jemalloc/issues/1621 https://github.com/jemalloc/jemalloc/wiki/Getting-Started
Performance Verification Results
1гЂЃsingle msyql client sequential execution
the sum of the averages of multiple executions of each query
- Clickbench:
sum of time(s) | peak mem(M) | |
---|---|---|
tcmalloc | 210 | 11927 |
jemalloc default conf | 203.17 | 20777 |
jemalloc optimize conf | 194.45 | 13594 |
- SSB:
sum of time(s) | peak mem(M) | |
---|---|---|
tcmalloc | 15.845 | 832 |
jemalloc default conf | 15.052 | 2920 |
jemalloc optimize conf | 13.828 | 2091 |
Looking at the flame graph, the time-consuming of submitting sql in a single mysql client is not in memory, so the performance improvement is less.
2гЂЃjmeter stress test, disable page cache and chunk allocator and mem pool
Take the results of the second stress test for each sql, because jemalloc has a cold start, the first stress test is more aggressively cached, and the second stress test starts to get faster.
- Clickbench, only q13 + q14
sum of time(s) | |
---|---|
tcmalloc | 82611 |
jemalloc default conf | 72688 |
jemalloc optimize conf | 42506 |
the performance of jemalloc is doubled. 2) SSB, only q1.1 - q3.4
sum of time(s) | |
---|---|
tcmalloc | 76760 |
jemalloc default conf | 73326 |
jemalloc optimize conf | 57297 |
the performance of jemalloc is improved by 25%.
3гЂЃjmeter stress test, enable page cache and chunk allocator and mem pool(default conf)
- Clickbench, only q13 + q14
sum of time(s) | |
---|---|
tcmalloc | 73616 |
jemalloc default conf | 62220 |
jemalloc optimize conf | 39565 |
the performance of jemalloc is improved by 46%. 2) SSB, only q1.1 - q3.4
sum of time(s) | |
---|---|
tcmalloc | 53709 |
jemalloc default conf | 47790 |
jemalloc optimize conf | 43297 |
the performance of jemalloc is improved by 19%.
4гЂЃjmeter stress test, disable page cache, enable chunk allocator and mem pool
- Clickbench, only q13 + q14
sum of time(s) | |
---|---|
tcmalloc | 74949 |
jemalloc default conf | 61335 |
jemalloc optimize conf | TODO |
5гЂЃjmeter stress test, v1.1.1 vs master
sql:
with e as (select b.Title,b.measure1,b.measure2 from (select a.Title, sum(case when a.PageCharset = 'windows-1251;charset' then 1 else 0 end) as measure1,sum(case when a.RefererHash in('-296158784638538920', '-6389909303817027441') then cast(a.JavaEnable AS Double) else 0 end) as measure2 from hits a where CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND Title <> '' group by 1) b) , f as (select avg(e.measure2) as avg1,avg(e.measure1) as avg2,var_samp(e.measure2) as variance1,var_samp(e.measure1) as variance2 from e),g as (select sum((e.measure1-f.avg2)*(e.measure2-f.avg1)) as covariance from e,f) select f.avg2,f.variance1,f.variance2,g.covariance as covariance from f,g;
v1.1.1:
jmeter thread=10 | jmeter thread=20 | jmeter thread=30 | |
---|---|---|---|
tcmalloc | 1095 | 1818 | 2870 |
jemalloc default conf | 581 | 1171 | 1802 |
master:
jmeter thread=8 | |
---|---|
tcmalloc | 573 |
jemalloc default conf | 502 |
- On v1.11, jemalloc improves performance by 55% - double, the more jmeter threads, the greater the bottleneck outside the memory, and the smaller the performance improvement.
- jemalloc on the master improves performance by 20%, because the master does a lot of memory reuse.
6гЂЃjmeter stress test, Try replacing ChunkAllocator with jemalloc in vec
- Clickbench, only q13 + q14
sum of time(s) | |
---|---|
jemalloc lg_tcache_max:16 + ChunkAllocator | 39565 |
jemalloc lg_tcache_max:26 | 40158 |
Size < 4K using jemalloc, 4K < size < 64M using ChunkAllocator, the performance is still a little higher. TODO, more testing and tuning, looking forward to replacing ChunkAllocator
Performance Verification Reproduce
1гЂЃsingle mysql client sequential execution
be.conf add
`enable_tcmalloc_hook=false`
`disable_storage_page_cache=true`
`disable_mem_pools=true`
`chunk_reserved_bytes_limit=1`
- Clickbench
vim tools/clickbench-tools/run-clickbench-queries.sh
pre_set "set global parallel_fragment_exec_instance_num=1;"
pre_set "set global exec_mem_limit=20G;
sh tools/clickbench-tools/run-clickbench-queries.sh
- SSB
vim tools/ssb-tools/bin/run-ssb-queries.sh
pre_set "set global parallel_fragment_exec_instance_num=1;"
sh tools/ssb-tools/bin/run-ssb-queries.sh
2гЂЃjmeter stress test, disable page cache and chunk allocator and mem pool
set global parallel_fragment_exec_instance_num=1;
be.conf add
`enable_tcmalloc_hook=false`
`disable_storage_page_cache=true`
`disable_mem_pools=true`
`chunk_reserved_bytes_limit=1`
jmeter conf
- Clickbench
<stringProp name="ThreadGroup.num_threads">30</stringProp>
<stringProp name="ThreadGroup.ramp_time">1</stringProp>
<boolProp name="ThreadGroup.scheduler">true</boolProp>
<stringProp name="ThreadGroup.duration">100</stringProp>
<stringProp name="ThreadGroup.delay">0</stringProp>
- SSB
<stringProp name="ThreadGroup.num_threads">10</stringProp>
<stringProp name="ThreadGroup.ramp_time">1</stringProp>
<boolProp name="ThreadGroup.scheduler">true</boolProp>
<stringProp name="ThreadGroup.duration">30</stringProp>
<stringProp name="ThreadGroup.delay">0</stringProp>
3гЂЃjmeter stress test, enable page cache and chunk allocator and mem pool
set global parallel_fragment_exec_instance_num=1;
be.conf add
`enable_tcmalloc_hook=false`
4гЂЃjmeter stress test, disable page cache, enable chunk allocator and mem pool
set global parallel_fragment_exec_instance_num=1;
be.conf add
`enable_tcmalloc_hook=false`
`disable_storage_page_cache=true`
5гЂЃjmeter stress test, v1.1.1 vs master
set global parallel_fragment_exec_instance_num=1;
be.conf add
`enable_tcmalloc_hook=false`
`disable_storage_page_cache=true`
`disable_mem_pools=true`
`chunk_reserved_bytes_limit=1`
Performance Verification Data
Checklist(Required)
- Does it affect the original behavior:
- [x] Yes
- [ ] No
- [ ] I don't know
- Has unit tests been added:
- [ ] Yes
- [ ] No
- [x] No Need
- Has document been added or modified:
- [ ] Yes
- [x] No
- [ ] No Need
- Does it need to update dependencies:
- [x] Yes
- [ ] No
- Are there any changes that cannot be rolled back:
- [ ] Yes (If Yes, please explain WHY)
- [x] No
both clickhouse and pingcap has swtiched to jemalloc. https://github.com/ClickHouse/ClickHouse/pull/2773 https://github.com/pingcap/tiflash/pull/424
both clickhouse and pingcap has swtiched to jemalloc. ClickHouse/ClickHouse#2773 pingcap/tiflash#424
I will refer to more later