dcurl icon indicating copy to clipboard operation
dcurl copied to clipboard

Avoid frequent memory allocation/deallocation by memory pool

Open jserv opened this issue 7 years ago • 16 comments
trafficstars

Current PoW internals consist of various malloc and free, which are called frequently. It is bad for performance considerations. Using memory pool is a common technique to speed up and ensure consistent execution time.

I have done preliminary memory pool: https://github.com/jserv/dcurl/tree/memory-pool NOTE: we might have to manipulate with thread-safe issues, and check out existing implementations such as philip-wernersbach/memory-pool-allocator.

jserv avatar Mar 18 '18 20:03 jserv

After applying enable-rdtsc.patch, I got the following time-stamp numbers:

*** Validating build/test_trinary ***
=== trits_from_trytes: 42320 ===
=== trytes_from_trits: 5103 ===

*** Validating build/test_curl ***
=== trits_from_trytes: 220208 ===
=== trytes_from_trits: 3171 ===

*** Validating build/test_pow_sse ***
=== trits_from_trytes: 76903 ===
=== trits_from_trytes: 32099 ===
=== trytes_from_trits: 2245 ===
=== trits_from_trytes: 33132 ===
=== trytes_from_trits: 2674 ===
=== trits_from_trytes: 2651 ===

jserv avatar Mar 18 '18 22:03 jserv

To illustrate the memory impact, TCMalloc is used for comparisons. The following environment is Intel Xeon E5 class server with Ubuntu Linux 17.04.

First, prepare TCMalloc: $ sudo apt install libtcmalloc-minimal4.

  • without TCMalloc
$ make check
*** Validating build/test_trinary ***
=== trits_from_trytes: 5460 ===
=== trytes_from_trits: 4286 ===

*** Validating build/test_curl ***
=== trits_from_trytes: 120820 ===
=== trytes_from_trits: 3940 ===

*** Validating build/test_pow_sse ***
=== trits_from_trytes: 68535 ===
=== trits_from_trytes: 61490 ===
=== trytes_from_trits: 1221 ===
=== trits_from_trytes: 31277 ===
=== trytes_from_trits: 1617 ===
=== trits_from_trytes: 2668 ===
  • with TCMalloc
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 make check
*** Validating build/test_trinary ***
=== trits_from_trytes: 29244 ===
=== trytes_from_trits: 3920 ===

*** Validating build/test_curl ***
=== trits_from_trytes: 108290 ===
=== trytes_from_trits: 2940 ===

*** Validating build/test_pow_sse ***
=== trits_from_trytes: 79500 ===
=== trits_from_trytes: 57566 ===
=== trytes_from_trits: 3028 ===
=== trits_from_trytes: 31777 ===
=== trytes_from_trits: 1654 ===
=== trits_from_trytes: 1980 ===

dcurl would benefit from the use of pre-allocated memory pool especially when its size is tweaked for trytes/trits representations.

jserv avatar Mar 19 '18 12:03 jserv

It is worth mentioning that heap memory which every PoW task allocates is fixed. Maybe we can implement a special memory pool for allocating trytes & trits.

furuame avatar Mar 19 '18 19:03 furuame

In our scenario, the memory usage every PoW task (thread) uses is "fixed" and the variables can be reused. I think we can declare all the variables in advance rather than allocating it from memory pool every time.

furuame avatar Aug 07 '18 15:08 furuame

The tool heaptrack can show the information of the dynamic memory allocation. Such as:

  • allocation times
  • allocation bytes
  • memory leak

2018-10-12 10-08-27

The information gives us the blueprint of the memory pool design. It helps us determine the size of the memory pool.

marktwtn avatar Oct 12 '18 02:10 marktwtn

Dynamic memory allocation tends to be non-deterministic, and is it possible to elininate existing dynamic allocation inside dcurl?

jserv avatar Oct 13 '18 15:10 jserv

Dynamic memory allocation tends to be non-deterministic, and is it possible to elininate existing dynamic allocation inside dcurl?

Yes, we can eliminate the dynamic allocation to once or even use a declared char array as a memory pool.

marktwtn avatar Oct 15 '18 03:10 marktwtn

I have implemented a memory pool mechanism and integrated it into the dcurl - SSE.

Here are the problems:

  1. Experiment result I run the test-pow with executing the PoW 100 times. The execution time does not have much difference. The time stamp difference of allocating a memory in trits_from_trytes and trytes_from_trits functions may even worse.

    To solve the problem (1) Use perf or gprof to analyze the memory pool code and improve the performance. (2) Run the program multiple times to see the execution time distribution.

  2. Allocation size Take SSE as an example. Most allocation size is fixed. However, there are some allocation sizes which are related to the maximum thread number and maximum core number. I leave these memory allocation unchanged.

marktwtn avatar Oct 22 '18 05:10 marktwtn

  1. Experiment result Forget about the execution time. It is not related to the memory pool. I use rdtsc to read the time stamp counter difference of each memory allocation.

    0 ~ 150 sample point timestamp0-150

    2000 ~ 2150 sample point timestamp2000-2150

    The graphs show the time stamp difference of allocating a memory with malloc function and memory pool by running PoW 100 times. The memory pool looks better than the dynamic memory allocation. However, there is a strange peak in memory pool. It happens when allocating a 16B memory right after the PoW is finished. Still looking for the reason of the weird behavior.

    The previous comment says the result is worse, that is caused by getting the time stamp counter value at the wrong line of the source code.

  2. Allocation size Based on the previous comment. there are some allocation which are related to the maximum thread and core number. If these numbers can be determined, then there would be no problems at all.

marktwtn avatar Oct 25 '18 01:10 marktwtn

rdtsc is not accurate for SMP.

jserv avatar Oct 25 '18 07:10 jserv

rdtsc is not accurate for SMP.

However, even if I use clock_gettime function to acquire the time difference, the result is still the same.

marktwtn avatar Oct 25 '18 09:10 marktwtn

When I was using the analysis tool such as perf, I found out that the PoW part took the most of the calculation. Therefore, it was hard to see the behaviour of the other functions such as memory pool allocation.

However, the suggestion to empty the PoW function did not work properly. Since the time stamp counter difference of each memory allocation is somehow affected by the PoW function.

marktwtn avatar Nov 02 '18 08:11 marktwtn

Ouch! It is a pity. I look forward to the migration to other memory allocators.

jserv avatar Nov 02 '18 10:11 jserv

Since rdtsc can be afftected by out-of-order execution and variable CPU clock frequency, the measurement is replaced with the function clock_gettime.

The following charts come up with running on different hardware and commenting the specific function transfromXXX() or not.

  • My desktop with transformXXX() cpu-desktop-nano without transformXXX() cpu-desktop-nano-notrans

  • My laptop with transformXXX() cpu-laptop-nano without transformXXX() cpu-laptop-nano-notrans

  • node.deviceproof.org (with DCURL_CPU_NUM=3) with transformXXX() cpu-device-nano-3core without transformXXX() cpu-device-nano-notrans-3core

The question and conclusion: Comment out the important function transfromXXX() in PoW do reduce the impact on memory allocation. However, the reason is not cleared. (I guess it is caused by the cache.) And the memory allocation time is not stabilized, which means the memory allocator is not good enough or there are other impacts in dcurl.

Keep investigating.

marktwtn avatar Nov 07 '18 18:11 marktwtn

After #95 is resolved, we can continue memory pool engagement.

jserv avatar Jan 31 '19 06:01 jserv

Cc. @JulianATA

jserv avatar Nov 04 '19 16:11 jserv