Problem overview

Consider a cluster processing 100k PUT msgs/s on its limit, meaning, that it can barely process this amount of PUTs without NACKs and producers generate exactly this rate of messages. Consider a rollover or any other lag causing Queue dispatcher thread to stuck for 100ms. After this time, there will be 10k pending PUT messages waiting in QUEUE dispatcher queue: [Put1]...[Put10000]

Now, consider a GC called before this block of PUTs. It cleared something in queue history, but not everything, and also we scheduled another GC call to the QUEUE dispatcher: [Put1]...[Put10000][gcHistory]

To call this gcHistory, we need to process all PUTs before (they are earlier in the queue). So we process 10k PUTs and add them to queue history. Since in this scenario producers generate 100k msgs/s and cluster is also able to process 100k msgs/s and no more, by the time we process Put1-Put10000, there will be another 10k PUTs in the queue: [gcHistory][Put10001]...[Put20000]

So when we call gcHistory, we will remove at most 1k msgs from the history (k_GC_MESSAGES_BATCH_SIZE = 1000). But at the same time, we will add 10k new items to queue history and get another 10k PUTs to the QUEUE dispatcher queue, delaying the next gcHistory.

And this process will continue. So we will constantly allocate new items in queue history items pool, and we will eventually upsize the number of buckets in ordered hashmap, causing performance lag and full rehashing of hashtable.

20kk items in queue history after a 5-minute run on 80k msgs/s: Screenshot 2024-11-03 at 02 45 23

3GB allocated for queue history (Handles):

:::::::::: :::::::::: ALLOCATORS >> Last snapshot was 1.00 m ago.
                                 | Bytes Allocated| -delta-| Max Bytes Allocated| Allocations| -delta-| Deallocations| -delta-
---------------------------------+----------------+--------+--------------------+------------+--------+--------------+--------
allocators                       |   7,424,269,408|  86,800|       7,424,336,192| 106,779,796| 315,271|   103,623,032| 315,254
  *direct*                       |           1,584|        |               1,584|          10|        |             1|        
  Application                    |   7,423,728,624|  87,632|       7,423,748,064|  84,002,540| 243,085|    80,846,206| 243,063
    *direct*                     |           9,600|        |              11,440|       1,570|        |         1,544|        
    ClusterCatalog               |   3,426,265,408|     528|       3,426,279,424|  17,366,654|  39,631|    17,365,735|  39,630
      *direct*                   |           3,776|        |               3,776|          21|        |             3|        
      cl6_dc3                    |   3,426,261,632|     528|       3,426,275,648|  17,366,633|  39,631|    17,365,732|  39,630
        *direct*                 |       1,019,184|        |           1,056,640|   4,229,079|        |     4,228,883|        
        StorageManager           |   3,382,069,840|     528|       3,382,083,680|  13,137,171|  39,631|    13,136,714|  39,630
          *direct*               |           7,920|        |               7,920|          41|        |            12|        
          Partition1             |   3,361,992,272|     528|       3,362,006,112|  13,136,758|  39,631|    13,136,583|  39,630
            *direct*             |           5,824|     528|              19,664|  13,136,547|  39,631|    13,136,522|  39,630
            QueueStorage         |   3,163,421,248|        |       3,163,421,408|         139|        |            40|        
              *direct*           |           2,096|        |               2,256|          15|        |             4|        
              Handles            |   3,059,921,280|        |       3,059,921,280|          63|        |            20|

Alternative 1

Set k_GC_MESSAGES_BATCH_SIZE = 1000000: Screenshot 2024-11-03 at 02 47 12

Alternative 2

insert to history is the only operation that can increase history size, and, if we know that GC required, we can enforce GC on any insert operation. So the workflow is:

We exec scheduled gcHistory and find out that we cleaned 1000 items from history, but there are more. We set d_requireGC flag for history.
Next, we have a PUT event. We do insert to history on PUT. insert checks for d_requireGC flag and execs gc for another batch of items.
This process continues for one or more PUTs, but eventually we reach the end of gc, when there are no items to gc. We unset d_requireGC flag, so the following PUTs do not cause GC.
The situation remains the same until the next scheduler call possibly set d_requireGC again.

So we pay some CPU time on PUT operations to do the GC, but at the same time:

We save CPU time on allocating elements to history items pool.
We save CPU on rehashing the history hashmap when going to bigger scale.
We occupy less memory, the memory growth is not uncontrollable anymore.
We save some CPU on enque gcHistory callback to the same dispatcher queue.

Screenshot 2024-11-03 at 02 57 10

Comparison

80k msgs/s priority stress test shows the same results as 90k msgs/s with the proposed change. This means, that with this change we can actually hold better on higher throughputs.

Screenshot 2024-11-03 at 03 11 25

Screenshot 2024-11-03 at 03 11 10

Nov 03 '24 02:11 678098

Before, performing gc on history was always sort of best-effort and done in the background. I think I agree on approach 1 not being ideal, it doesn't get at the core of the problem which is mismatched pacing of PUTs and gc; gc is still too periodic. Approach 2 is interesting, I hadn't considered moving gc into the critical path.

With approach 2, if (and only if) background gc can't keep history limited to the configured time, the broker starts doing gc on the critical data path when processing incoming PUTs. When doing that gc, k_INSERT_GC_MESSAGES_BATCH_SIZE controls how aggressive the gc is and since a PUT is only one message, any value > 1 will eventually reduce history to the desired size?

I'm curious if you've got any measurements on the gc time, do we have a rough number for how long that takes? I suspect it's pretty cheap. It'd be interesting to see if a smaller batch of gc on PUTs helps (in terms of latency maybe?) at all.

Nov 05 '24 15:11 chrisbeard

I hadn't considered moving gc into the critical path.

Actually, we do gc from the same QUEUE dispatcher thread, but just more rarely.

and since a PUT is only one message, any value > 1 will eventually reduce history to the desired size?

That is correct. However, there should also be some overhead of calling gc, making some initial checks and going into the loop etc. Need to check if it's inlined or not.

I'm curious if you've got any measurements on the gc time, do we have a rough number for how long that takes? I suspect it's pretty cheap.

Yes, I think it is cheap, but I will also add benchmarks.

Nov 05 '24 19:11 678098

Perf[BMQ,MQB]: stable GC history preventing allocations

Problem overview

Alternative 1

Alternative 2

Comparison