dcurl
dcurl copied to clipboard
Avoid frequent memory allocation/deallocation by memory pool
Current PoW internals consist of various malloc and free, which are called frequently. It is bad for performance considerations. Using memory pool is a common technique to speed up and ensure consistent execution time.
I have done preliminary memory pool: https://github.com/jserv/dcurl/tree/memory-pool NOTE: we might have to manipulate with thread-safe issues, and check out existing implementations such as philip-wernersbach/memory-pool-allocator.
After applying enable-rdtsc.patch, I got the following time-stamp numbers:
*** Validating build/test_trinary ***
=== trits_from_trytes: 42320 ===
=== trytes_from_trits: 5103 ===
*** Validating build/test_curl ***
=== trits_from_trytes: 220208 ===
=== trytes_from_trits: 3171 ===
*** Validating build/test_pow_sse ***
=== trits_from_trytes: 76903 ===
=== trits_from_trytes: 32099 ===
=== trytes_from_trits: 2245 ===
=== trits_from_trytes: 33132 ===
=== trytes_from_trits: 2674 ===
=== trits_from_trytes: 2651 ===
To illustrate the memory impact, TCMalloc is used for comparisons. The following environment is Intel Xeon E5 class server with Ubuntu Linux 17.04.
First, prepare TCMalloc: $ sudo apt install libtcmalloc-minimal4.
- without TCMalloc
$ make check
*** Validating build/test_trinary ***
=== trits_from_trytes: 5460 ===
=== trytes_from_trits: 4286 ===
*** Validating build/test_curl ***
=== trits_from_trytes: 120820 ===
=== trytes_from_trits: 3940 ===
*** Validating build/test_pow_sse ***
=== trits_from_trytes: 68535 ===
=== trits_from_trytes: 61490 ===
=== trytes_from_trits: 1221 ===
=== trits_from_trytes: 31277 ===
=== trytes_from_trits: 1617 ===
=== trits_from_trytes: 2668 ===
- with TCMalloc
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 make check
*** Validating build/test_trinary ***
=== trits_from_trytes: 29244 ===
=== trytes_from_trits: 3920 ===
*** Validating build/test_curl ***
=== trits_from_trytes: 108290 ===
=== trytes_from_trits: 2940 ===
*** Validating build/test_pow_sse ***
=== trits_from_trytes: 79500 ===
=== trits_from_trytes: 57566 ===
=== trytes_from_trits: 3028 ===
=== trits_from_trytes: 31777 ===
=== trytes_from_trits: 1654 ===
=== trits_from_trytes: 1980 ===
dcurl would benefit from the use of pre-allocated memory pool especially when its size is tweaked for trytes/trits representations.
It is worth mentioning that heap memory which every PoW task allocates is fixed. Maybe we can implement a special memory pool for allocating trytes & trits.
In our scenario, the memory usage every PoW task (thread) uses is "fixed" and the variables can be reused. I think we can declare all the variables in advance rather than allocating it from memory pool every time.
The tool heaptrack can show the information of the dynamic memory allocation. Such as:
- allocation times
- allocation bytes
- memory leak

The information gives us the blueprint of the memory pool design. It helps us determine the size of the memory pool.
Dynamic memory allocation tends to be non-deterministic, and is it possible to elininate existing dynamic allocation inside dcurl?
Dynamic memory allocation tends to be non-deterministic, and is it possible to elininate existing dynamic allocation inside dcurl?
Yes, we can eliminate the dynamic allocation to once or even use a declared char array as a memory pool.
I have implemented a memory pool mechanism and integrated it into the dcurl - SSE.
Here are the problems:
-
Experiment result I run the
test-powwith executing thePoW100 times. The execution time does not have much difference. The time stamp difference of allocating a memory intrits_from_trytesandtrytes_from_tritsfunctions may even worse.To solve the problem (1) Use
perforgprofto analyze the memory pool code and improve the performance. (2) Run the program multiple times to see the execution time distribution. -
Allocation size Take
SSEas an example. Most allocation size is fixed. However, there are some allocation sizes which are related to the maximum thread number and maximum core number. I leave these memory allocation unchanged.
-
Experiment result Forget about the execution time. It is not related to the memory pool. I use
rdtscto read the time stamp counter difference of each memory allocation.0 ~ 150 sample point

2000 ~ 2150 sample point

The graphs show the time stamp difference of allocating a memory with
mallocfunction andmemory poolby runningPoW100 times. The memory pool looks better than the dynamic memory allocation. However, there is a strange peak in memory pool. It happens when allocating a 16B memory right after thePoWis finished. Still looking for the reason of the weird behavior.The previous comment says the result is worse, that is caused by getting the time stamp counter value at the wrong line of the source code.
-
Allocation size Based on the previous comment. there are some allocation which are related to the maximum
threadandcorenumber. If these numbers can be determined, then there would be no problems at all.
rdtsc is not accurate for SMP.
rdtscis not accurate for SMP.
However, even if I use clock_gettime function to acquire the time difference, the result is still the same.
When I was using the analysis tool such as perf, I found out that the PoW part took the most of the calculation.
Therefore, it was hard to see the behaviour of the other functions such as memory pool allocation.
However, the suggestion to empty the PoW function did not work properly.
Since the time stamp counter difference of each memory allocation is somehow affected by the PoW function.
Ouch! It is a pity. I look forward to the migration to other memory allocators.
Since rdtsc can be afftected by out-of-order execution and variable CPU clock frequency,
the measurement is replaced with the function clock_gettime.
The following charts come up with running on different hardware and commenting the specific function transfromXXX() or not.
-
My desktop with
transformXXX()
without transformXXX()
-
My laptop with
transformXXX()
without transformXXX()
-
node.deviceproof.org (with DCURL_CPU_NUM=3) with
transformXXX()
without transformXXX()
The question and conclusion:
Comment out the important function transfromXXX() in PoW do reduce the impact on memory allocation.
However, the reason is not cleared. (I guess it is caused by the cache.)
And the memory allocation time is not stabilized, which means the memory allocator is not good enough or there are other impacts in dcurl.
Keep investigating.
After #95 is resolved, we can continue memory pool engagement.
Cc. @JulianATA