doris icon indicating copy to clipboard operation
doris copied to clipboard

[enhancement](memory) Jemalloc performance optimization and compatibility with MemTracker

Open xinyiZzz opened this issue 2 years ago • 1 comments

Proposed changes

Issue Number: close #xxx

Problem summary

  1. Before jemalloc was compiled with arrow, it was compiled separately
  2. Modify the default parameters of jemalloc to achieve better performance and lower memory usage. This will significantly improve multi-threading and high concurrency memory performance.
  3. Jemalloc compatible mem tracker, which is consistent with the query mem tracker value of tcmalloc

Test commit: Wed Aug 31 1a198b3777ba0c8eb7c2ff31f226a816bcd4a472, does not include #12436 Optimize tcmalloc performance, so the test result of tcmalloc may be lower than the latest code.

refer to: https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md https://jemalloc.net/jemalloc.3.html https://github.com/jemalloc/jemalloc/issues/1621 https://github.com/jemalloc/jemalloc/wiki/Getting-Started

Performance Verification Results

1гЂЃsingle msyql client sequential execution

the sum of the averages of multiple executions of each query

  1. Clickbench:
sum of time(s) peak mem(M)
tcmalloc 210 11927
jemalloc default conf 203.17 20777
jemalloc optimize conf 194.45 13594
  1. SSB:
sum of time(s) peak mem(M)
tcmalloc 15.845 832
jemalloc default conf 15.052 2920
jemalloc optimize conf 13.828 2091

Looking at the flame graph, the time-consuming of submitting sql in a single mysql client is not in memory, so the performance improvement is less.

2гЂЃjmeter stress test, disable page cache and chunk allocator and mem pool

Take the results of the second stress test for each sql, because jemalloc has a cold start, the first stress test is more aggressively cached, and the second stress test starts to get faster.

  1. Clickbench, only q13 + q14
sum of time(s)
tcmalloc 82611
jemalloc default conf 72688
jemalloc optimize conf 42506

the performance of jemalloc is doubled. 2) SSB, only q1.1 - q3.4

sum of time(s)
tcmalloc 76760
jemalloc default conf 73326
jemalloc optimize conf 57297

the performance of jemalloc is improved by 25%.

3гЂЃjmeter stress test, enable page cache and chunk allocator and mem pool(default conf)

  1. Clickbench, only q13 + q14
sum of time(s)
tcmalloc 73616
jemalloc default conf 62220
jemalloc optimize conf 39565

the performance of jemalloc is improved by 46%. 2) SSB, only q1.1 - q3.4

sum of time(s)
tcmalloc 53709
jemalloc default conf 47790
jemalloc optimize conf 43297

the performance of jemalloc is improved by 19%.

4гЂЃjmeter stress test, disable page cache, enable chunk allocator and mem pool

  1. Clickbench, only q13 + q14
sum of time(s)
tcmalloc 74949
jemalloc default conf 61335
jemalloc optimize conf TODO

5гЂЃjmeter stress test, v1.1.1 vs master

sql:

with e as (select b.Title,b.measure1,b.measure2 from (select a.Title, sum(case when a.PageCharset = 'windows-1251;charset' then 1 else 0 end) as measure1,sum(case when a.RefererHash in('-296158784638538920', '-6389909303817027441') then cast(a.JavaEnable AS Double) else 0 end) as measure2 from hits  a where CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND Title <> '' group by 1) b) , f as (select avg(e.measure2) as avg1,avg(e.measure1) as avg2,var_samp(e.measure2) as variance1,var_samp(e.measure1) as variance2 from e),g as (select sum((e.measure1-f.avg2)*(e.measure2-f.avg1)) as covariance from e,f) select f.avg2,f.variance1,f.variance2,g.covariance as covariance from f,g;

v1.1.1:

jmeter thread=10 jmeter thread=20 jmeter thread=30
tcmalloc 1095 1818 2870
jemalloc default conf 581 1171 1802

master:

jmeter thread=8
tcmalloc 573
jemalloc default conf 502
  1. On v1.11, jemalloc improves performance by 55% - double, the more jmeter threads, the greater the bottleneck outside the memory, and the smaller the performance improvement.
  2. jemalloc on the master improves performance by 20%, because the master does a lot of memory reuse.

6гЂЃjmeter stress test, Try replacing ChunkAllocator with jemalloc in vec

  1. Clickbench, only q13 + q14
sum of time(s)
jemalloc lg_tcache_max:16 + ChunkAllocator 39565
jemalloc lg_tcache_max:26 40158

Size < 4K using jemalloc, 4K < size < 64M using ChunkAllocator, the performance is still a little higher. TODO, more testing and tuning, looking forward to replacing ChunkAllocator

Performance Verification Reproduce

1гЂЃsingle mysql client sequential execution

be.conf add

          `enable_tcmalloc_hook=false`
          `disable_storage_page_cache=true`
          `disable_mem_pools=true`
          `chunk_reserved_bytes_limit=1`
  1. Clickbench
vim tools/clickbench-tools/run-clickbench-queries.sh
    pre_set "set global parallel_fragment_exec_instance_num=1;"
    pre_set "set global exec_mem_limit=20G;
sh tools/clickbench-tools/run-clickbench-queries.sh
  1. SSB
vim tools/ssb-tools/bin/run-ssb-queries.sh
    pre_set "set global parallel_fragment_exec_instance_num=1;"
sh tools/ssb-tools/bin/run-ssb-queries.sh

2гЂЃjmeter stress test, disable page cache and chunk allocator and mem pool

set global parallel_fragment_exec_instance_num=1; be.conf add

          `enable_tcmalloc_hook=false`
          `disable_storage_page_cache=true`
          `disable_mem_pools=true`
          `chunk_reserved_bytes_limit=1`

jmeter conf

  1. Clickbench
        <stringProp name="ThreadGroup.num_threads">30</stringProp>
        <stringProp name="ThreadGroup.ramp_time">1</stringProp>
        <boolProp name="ThreadGroup.scheduler">true</boolProp>
        <stringProp name="ThreadGroup.duration">100</stringProp>
        <stringProp name="ThreadGroup.delay">0</stringProp>
  1. SSB
        <stringProp name="ThreadGroup.num_threads">10</stringProp>
        <stringProp name="ThreadGroup.ramp_time">1</stringProp>
        <boolProp name="ThreadGroup.scheduler">true</boolProp>
        <stringProp name="ThreadGroup.duration">30</stringProp>
        <stringProp name="ThreadGroup.delay">0</stringProp>

3гЂЃjmeter stress test, enable page cache and chunk allocator and mem pool

set global parallel_fragment_exec_instance_num=1; be.conf add

          `enable_tcmalloc_hook=false`

4гЂЃjmeter stress test, disable page cache, enable chunk allocator and mem pool

set global parallel_fragment_exec_instance_num=1; be.conf add

          `enable_tcmalloc_hook=false`
          `disable_storage_page_cache=true`

5гЂЃjmeter stress test, v1.1.1 vs master

set global parallel_fragment_exec_instance_num=1; be.conf add

          `enable_tcmalloc_hook=false`
          `disable_storage_page_cache=true`
          `disable_mem_pools=true`
          `chunk_reserved_bytes_limit=1`

Performance Verification Data

tcmalloc_vs_jemalloc.zip

Checklist(Required)

  1. Does it affect the original behavior:
    • [x] Yes
    • [ ] No
    • [ ] I don't know
  2. Has unit tests been added:
    • [ ] Yes
    • [ ] No
    • [x] No Need
  3. Has document been added or modified:
    • [ ] Yes
    • [x] No
    • [ ] No Need
  4. Does it need to update dependencies:
    • [x] Yes
    • [ ] No
  5. Are there any changes that cannot be rolled back:
    • [ ] Yes (If Yes, please explain WHY)
    • [x] No

xinyiZzz avatar Sep 08 '22 20:09 xinyiZzz

both clickhouse and pingcap has swtiched to jemalloc. https://github.com/ClickHouse/ClickHouse/pull/2773 https://github.com/pingcap/tiflash/pull/424

yiguolei avatar Sep 23 '22 03:09 yiguolei

both clickhouse and pingcap has swtiched to jemalloc. ClickHouse/ClickHouse#2773 pingcap/tiflash#424

I will refer to more later

xinyiZzz avatar Sep 26 '22 16:09 xinyiZzz