Feat[MQB]: Enhance queue consumption monitor alarm log with additional details
When a queue starts to fill up, it is valuable to see information about which AppIds are impacted, and information about the messages in the queue. Especially in the case of subscriptions (which we are enabling for everyone now), messages that match no subscription expression will build up in the put aside list. To help make this situation clearer to operators and users (what apps are impacted, why are messages building up, how old is the head of the queue for each app, etc), we can log more information when the watermark alarm is triggered:
- Sizes of put-aside list and redelivery list for each app for that queue;
- Oldest message's timestamp in put aside list & its message properties;
- Number of unconfirmed messages;
- Total size of messages for each app (but seems it is already done by
storage()->capacityMeter()->printShortSummary());
This is to help debug why a message doesn't match a subscription.
Alarm log looks like this:
ERROR mqbblp_rootqueueengine.cpp:1765 ALARM [QUEUE_STUCK] Queue 'bmq://bmq.test.mem.fanout/my1?id=foo' Messages [current: 2 / 1,000], Bytes [current: 110 B / 1.00 MB], max idle time 20.00 s appears to be stuck. It currently has 1 consumers.
1. bmqtool.tsk:[email protected]~localhost:37430
Handle Parameters .....: [ uri = "bmq://bmq.test.mem.fanout/my1" qId = 0 subIdInfo = NULL flags = 2 readCount = 1 writeCount = 0 adminCount = 0 ]
Number of unconfirmed messages .....: 1
UnconfirmedMonitors ....:
STATE_NORMAL [Messages (STATE_NORMAL): 1 (820 - 820 - 1,024), Bytes (STATE_NORMAL): 55 B (25.60 MB - 25.60 MB - 32.00 MB)]
STATE_NORMAL [Messages (STATE_NORMAL): 0 (820 - 820 - 1,024), Bytes (STATE_NORMAL): 0 B (25.60 MB - 25.60 MB - 32.00 MB)]
For appId: foo
Put aside list size: 1
Redelivery list size: 0
Number of messages: 2
Number of bytes: 110
Consumer subscription expressions:
x == 2
y == 4
y == 3
x == 1
Oldest message in a 'Put aside' list:
GUID Size Timestamp (UTC)
4000000000033E07AD91F4D9C7E5D709 55 B 24SEP2024_09:09:06.777472+0000
Message Properties: [ sample_str (STRING) = "foo bar" x (INT32) = 10 ]
10 oldest messages in the queue:
Printing 2 message(s) [0-1 / 2] (total: 110 B)
GUID Size Timestamp (UTC)
0: 4000000000033E07AD91F4D9C7E5D709 55 B 24SEP2024_09:09:06.777471+0000
1: 4000010000344E57D5C6F4D9C7E5D709 55 B 24SEP2024_09:12:37.504283+0000
Current head of the queue:
GUID Size Timestamp (UTC)
4000000000033E07AD91F4D9C7E5D709 55 B 24SEP2024_09:09:06.777471+0000
Implementation details:
- Log alarm logic is moved from
QueueConsumptionMonitorintoRootQueueEngineclass, where more data is available; - Callback is passed to
QueueConsumptionMonitorand called in case of alarm to log alarm data;
Regarding Maybe, we can print Storage::numMessages and Storage::numBytes as well?. CapacityMeter::printShortSummary() already prints numMessages and numBytes, e.g.
Messages [current: 2 / 1,000], Bytes [current: 66 B / 1.00 MB]
Aren't they the same as Storage::numMessages() and Storage::numBytes? In my test they printed the same values. Are there any possible scenarios when they will differ?
Regarding Can we get rid of QueueEngineUtil_AppState::head() and QueueConsumptionMonitor::SubStreamInfo::d_headCb now? Now we print oldest message from put aside list, but what happens if we have empty put aside list and not empty redelivery list? For example, there are 4 messages, and I limited to print only 3 oldest messages. Without printing head we don't know what is the last message in queue:
ERROR mqbblp_rootqueueengine.cpp:1766 ALARM [QUEUE_STUCK] Queue 'bmq://bmq.test.mem.fanout/my1?id=foo' Messages [current: 4 / 1,000], Bytes [current: 44 B / 1.00 MB], max idle time 20.00 s appears to be stuck. It currently has 0 consumers.
Put aside list size: 0
Redelivery list size: 3
3 oldest messages in the queue:
Printing 3 message(s) [0-2 / 4] (total: 33 B)
GUID Size Timestamp (UTC)
0: 400000000011896EDA51C2376057798E 11 B 17SEP2024_12:23:02.458416+0000
1: 400001000012F52671B0C2376057798E 11 B 17SEP2024_12:23:08.559553+0000
2: 4000020000132AB6F3D6C2376057798E 11 B 17SEP2024_12:23:09.458243+0000
Current head of the queue:
GUID Size Timestamp (UTC)
40000300001A6876353BC2376057798E 11 B 17SEP2024_12:23:40.559572+0000
Is head info valuable in this scenario?
Regarding Is QueueConsumptionMonitor::onTransitionToIdle "level triggered" (vs "edge triggered")? - QueueConsumptionMonitor::onTransitionToIdle is edge triggered, so onTransitionToIdle is called only once when state is transitioned from Active to Idle.
Aren't they the same as Storage::numMessages() and Storage::numBytes? In my test they
Looking at the FileBackedStorage::numMessages they are the same if asking for the queue stats and not appId stats.
It may make sense to print the stats for the App (vs the entire queue)
Aren't they the same as Storage::numMessages() and Storage::numBytes? In my test they
Looking at the
FileBackedStorage::numMessagesthey are the same if asking for the queue stats and not appId stats.
Thanks, make sense, so I will add Storage::numMessages and Storage::numBytes per appId (appKey).
Regarding
Can we get rid of QueueEngineUtil_AppState::head() and QueueConsumptionMonitor::SubStreamInfo::d_headCb now?
d_headCb is QueueEngineUtil_AppState::head() which is owned by the QueueEngine. So, there seems to be no need to pass it to RootQueueEngine::logAlarmCb.
QueueEngineUtil_AppState::head() should have answers about empty list(s) scenarios. We may want to revisit the logic though. Currently, we print put-aside if not empty, and the storage otherwise.
QueueConsumptionMonitor::onTransitionToIdleis edge triggered, soonTransitionToIdleis called only once when state is transitioned fromActivetoIdle.
Ok. This can flap, we often see a lot of subsequent logs. We are increasing the size of what's logged, so we may want to throttle the logging.
@678098 thank you, applied your suggestions.