Perf[BMQ,MQB]: stable GC history preventing allocations
Problem overview
Consider a cluster processing 100k PUT msgs/s on its limit, meaning, that it can barely process this amount of PUTs without NACKs and producers generate exactly this rate of messages.
Consider a rollover or any other lag causing Queue dispatcher thread to stuck for 100ms.
After this time, there will be 10k pending PUT messages waiting in QUEUE dispatcher queue:
[Put1]...[Put10000]
Now, consider a GC called before this block of PUTs. It cleared something in queue history, but not everything, and also we scheduled another GC call to the QUEUE dispatcher:
[Put1]...[Put10000][gcHistory]
To call this gcHistory, we need to process all PUTs before (they are earlier in the queue).
So we process 10k PUTs and add them to queue history. Since in this scenario producers generate 100k msgs/s and cluster is also able to process 100k msgs/s and no more, by the time we process Put1-Put10000, there will be another 10k PUTs in the queue:
[gcHistory][Put10001]...[Put20000]
So when we call gcHistory, we will remove at most 1k msgs from the history (k_GC_MESSAGES_BATCH_SIZE = 1000). But at the same time, we will add 10k new items to queue history and get another 10k PUTs to the QUEUE dispatcher queue, delaying the next gcHistory.
And this process will continue. So we will constantly allocate new items in queue history items pool, and we will eventually upsize the number of buckets in ordered hashmap, causing performance lag and full rehashing of hashtable.
20kk items in queue history after a 5-minute run on 80k msgs/s:
3GB allocated for queue history (Handles):
:::::::::: :::::::::: ALLOCATORS >> Last snapshot was 1.00 m ago.
| Bytes Allocated| -delta-| Max Bytes Allocated| Allocations| -delta-| Deallocations| -delta-
---------------------------------+----------------+--------+--------------------+------------+--------+--------------+--------
allocators | 7,424,269,408| 86,800| 7,424,336,192| 106,779,796| 315,271| 103,623,032| 315,254
*direct* | 1,584| | 1,584| 10| | 1|
Application | 7,423,728,624| 87,632| 7,423,748,064| 84,002,540| 243,085| 80,846,206| 243,063
*direct* | 9,600| | 11,440| 1,570| | 1,544|
ClusterCatalog | 3,426,265,408| 528| 3,426,279,424| 17,366,654| 39,631| 17,365,735| 39,630
*direct* | 3,776| | 3,776| 21| | 3|
cl6_dc3 | 3,426,261,632| 528| 3,426,275,648| 17,366,633| 39,631| 17,365,732| 39,630
*direct* | 1,019,184| | 1,056,640| 4,229,079| | 4,228,883|
StorageManager | 3,382,069,840| 528| 3,382,083,680| 13,137,171| 39,631| 13,136,714| 39,630
*direct* | 7,920| | 7,920| 41| | 12|
Partition1 | 3,361,992,272| 528| 3,362,006,112| 13,136,758| 39,631| 13,136,583| 39,630
*direct* | 5,824| 528| 19,664| 13,136,547| 39,631| 13,136,522| 39,630
QueueStorage | 3,163,421,248| | 3,163,421,408| 139| | 40|
*direct* | 2,096| | 2,256| 15| | 4|
Handles | 3,059,921,280| | 3,059,921,280| 63| | 20|
Alternative 1
Set k_GC_MESSAGES_BATCH_SIZE = 1000000:
Alternative 2
insert to history is the only operation that can increase history size, and, if we know that GC required, we can enforce GC on any insert operation. So the workflow is:
- We exec scheduled
gcHistoryand find out that we cleaned1000items from history, but there are more. We setd_requireGCflag for history. - Next, we have a PUT event. We do
insertto history on PUT.insertchecks ford_requireGCflag and execsgcfor another batch of items. - This process continues for one or more PUTs, but eventually we reach the end of
gc, when there are no items togc. We unsetd_requireGCflag, so the following PUTs do not cause GC. - The situation remains the same until the next scheduler call possibly set
d_requireGCagain.
So we pay some CPU time on PUT operations to do the GC, but at the same time:
- We save CPU time on allocating elements to history items pool.
- We save CPU on rehashing the history hashmap when going to bigger scale.
- We occupy less memory, the memory growth is not uncontrollable anymore.
- We save some CPU on enque
gcHistorycallback to the same dispatcher queue.
Comparison
80k msgs/s priority stress test shows the same results as 90k msgs/s with the proposed change. This means, that with this change we can actually hold better on higher throughputs.
Before, performing gc on history was always sort of best-effort and done in the background. I think I agree on approach 1 not being ideal, it doesn't get at the core of the problem which is mismatched pacing of PUTs and gc; gc is still too periodic. Approach 2 is interesting, I hadn't considered moving gc into the critical path.
With approach 2, if (and only if) background gc can't keep history limited to the configured time, the broker starts doing gc on the critical data path when processing incoming PUTs. When doing that gc, k_INSERT_GC_MESSAGES_BATCH_SIZE controls how aggressive the gc is and since a PUT is only one message, any value > 1 will eventually reduce history to the desired size?
I'm curious if you've got any measurements on the gc time, do we have a rough number for how long that takes? I suspect it's pretty cheap. It'd be interesting to see if a smaller batch of gc on PUTs helps (in terms of latency maybe?) at all.
I hadn't considered moving gc into the critical path.
Actually, we do gc from the same QUEUE dispatcher thread, but just more rarely.
and since a PUT is only one message, any value > 1 will eventually reduce history to the desired size?
That is correct. However, there should also be some overhead of calling gc, making some initial checks and going into the loop etc. Need to check if it's inlined or not.
I'm curious if you've got any measurements on the gc time, do we have a rough number for how long that takes? I suspect it's pretty cheap.
Yes, I think it is cheap, but I will also add benchmarks.