activemq-artemis ARTEMIS-4579 Add the *FirstMessage* API for scheduled messages

Alerting on issues with messages not being received properly for a period of time is an uneasy task. We use the getFirstMessageAge() command to trigger alerts in Zabbix, and it works as long as there are no consumers. But this approach fails when there are consumers repeatedly failing to receive a message. That message is getting scheduled for redelivery over and over, and even though there still is an old message in the queue to be reported, it's no longer visible via getFirstMessage*() API.

The goal here is to add a set of functions working with messages scheduled for delivery:

getFirstScheduledMessageAsJSON()
getFirstScheduledMessageTimestamp()
getFirstScheduledMessageAge()

It may be not the most effective approach but it's quite a convenient one, especially when monitoring a wide set of queues, each with its own set of alerts.

Jan 19 '24 05:01 jsmucr

my only concern is storing a reference outside the data structure for the first message. treeMap.first() is there for right that purpose.

on the implementation for first(), this is just returning:

   public E first() {
        return m.firstKey();
    }

Which means you can always just return first().

I'm not sure if you need to check for empty or not.. as you would need to double check on that.

Jan 22 '24 19:01 clebertsuconic

Thanks for the review @clebertsuconic, I'll process your suggestions as the first thing tomorrow morning.

Jan 22 '24 20:01 jsmucr

For awhile I've actually discouraged folks from using the "first message" metrics. I discussed this on the ActiveMQ users list not long ago:

The getFirstMessageAge operation is actually fairly "heavy" and not generally recommended. Furthermore, the age of the first message isn't meaningful in and of itself in this scenario because if the consumerCount is 0 then by definition no messages can be stuck. A robust stuck-message detection mechanism must, at the very least, verify that consumerCount > 0. Also, instead of using the age of the first message I recommend inspecting messagesAcknowledged over time. For example, if the consumerCount > 0 and messagesAcknowledged remains unchanged for 60 seconds then messages (or more likely consumers) may be stuck. If you're using Prometheus then I believe you can use a range vector selector for this kind of operation.

At this point I'm against adding "first message" metrics for scheduled messages because it will be a relatively "heavy" operation due to the synchronized block. A lot of JMX monitoring tools will simply poll queue MBeans which means this new management method may be invoked a lot, especially on a broker with lots of queues. Over the last few years we've seen an increasing number of deployments with many thousands of queues. This is one reason we implemented (and generally recommend using) pluggable metrics which should provide a lighter footprint than JMX and open the door for easier integration with tools that specialize in graphing and alerting (e.g. Prometheus & Grafana).

Would it be possible for you to use existing metrics to solve your problem rather than implementing this new management method?

Jan 23 '24 04:01 jbertram

@jbertram Only by very complicated workarounds. I need to find out if there's an old message in a queue that's been there for longer than a specified interval, and I'm not allowed to use DLQs for that, because it means a manual intervention on the broker if processing fails too many times. I know this method may be heavy but so is our workflow here, which is pretty much based on wasting resources in exchange for usability and easy maintenance. :-) Our Zabbix instance checks each queue about once a minute, so it's unlikely to cause any performance hit. I understand your point of view, I really do, but I also find Artemis to be quite an advanced tool all by itself, and that if I slow it down by using a method that is heavy on resources, it's entirely my fault as an operator. Perhaps a page of documentation would be enough to get rid of responsibility here.

Jan 23 '24 05:01 jsmucr

I've added a MessageReferenceImpl leak test.

Jan 23 '24 09:01 jsmucr

I need to find out if there's an old message in a queue that's been there for longer than a specified interval, and I'm not allowed to use DLQs for that, because it means a manual intervention on the broker if processing fails too many times.

I'd really like to focus on exactly why you need to know if there's an old message in a queue that's been there for longer than a specified interval?

As noted in my previous comment, in order to have a robust detection of stalled consumers (a.k.a. stuck messages) you already need to look at multiple metrics. If you're relying solely on a metric like firstMessageAge you're liable to get false positives. It's just not a good solution. I've long considered deprecating these metrics because folks tend to misunderstand and misuse them.

Keep in mind that we have slow consumer detection built into the broker.

I know this method may be heavy but so is our workflow here, which is pretty much based on wasting resources in exchange for usability and easy maintenance.

The problem, as I see it, is that by implementing this method you're going to force this wasting of resources on other users who likely don't want it. As noted previously, it's very common for JMX monitoring tools to fetch the values of every attribute on a given MBean. By adding this method you're implicitly impacting that use-case. Lots of folks use JMX monitoring tools that scan MBeans more than once per minute.

Jan 23 '24 17:01 jbertram

I'm also concerned about adding weight here.

I would be okay on adding a method to peek at the first message.. but adding extra processing weight for those who don't need this.. I'm kind of concerned also.

Can't you get this by listScheduledMessages on the control?

or... also I would be okay on adding a metric for number of scheduled messages... (I think it would make sense).. as that would just get the counter directly. and then you can do your processing on listing all the messages if certain metrics apply.

Also, you could tell a lot by the number of redeliveries on the system. you should probably add a warning on your system for the redeliveries counter.

Jan 23 '24 17:01 clebertsuconic

or... also I would be okay on adding a metric for number of scheduled messages...

We already have that. It's called scheduledCount.

Jan 23 '24 17:01 jbertram

Right! I looked at the callers on the TreeMap.. I forgot we have the metrics in place.

Jan 23 '24 17:01 clebertsuconic

Okay, perhaps I'm not clear enough on what I'm trying to achieve here. And I understand your concerns, so I'll be glad for just another simple solution of my problem if you can think of any.

Here's the use case:

There's a SLA for every file we run through the broker. It's crucial to know there's a message running around for 10 minutes while it was supposed to leave our system 5 minutes ago. Then there's a five minute trigger which puts us on guard but it's quite common for individual consumers to run into issues with remote customer services (such as connection or gateway timeouts), and it's likely for a warning to disappear before we hit the 10-minute alert.
There's a 24/7 first level support without any access to underlying systems except for those alerts. These people wake us up at night based on what an alert says. If it says 5 minutes, they wait (because it's yellow), once it hits 10 minutes, we already know there's something to take a look at.
In order to figure out which message is the oldest one that cannot be delivered, we need to look (all sleepy and grumpy) at its metadata. Currently, this basically means to pause the queue, because the GUI doesn't display messages-in-flight. The time is running out, so a GUI is a must. Unless, of course, there's an attribute, which directly shows the metadata and allows one to act quickly.

Jan 23 '24 19:01 jsmucr

And as I was composing my previous post, I realized that maybe it's not the idea but the approach that we couldn't agree upon!

Can't you get this by listScheduledMessages on the control?

That's pretty close actually. In order to be able to monitor the entire cluster via Zabbix I've created sort of a JMX proxy which collects and exposes data from all nodes at once. That proxy is also capable of calling methods, and exposing their results as custom JMX attributes.

So basically, the issue you have with the *FirstMessage* API is that it's a set of attributes, not a set of operations, right?

Jan 23 '24 19:01 jsmucr

@jbertram if this was an operation instead of an attribute, then you wouldn't have the issue of scans from JMX providers, right?

Jan 23 '24 21:01 clebertsuconic

I've replaced the attributes with this new peekFirstScheduledMessage() operation. Maybe I could add peekFirstMessage() too, or perhaps rename both to avoid confusion: peekOldestMessage(), peekOldestScheduledMessage()

Jan 24 '24 07:01 jsmucr

Currently, this basically means to pause the queue, because the GUI doesn't display messages-in-flight.

There is a management method which shows details of in-flight messages. It's called listDeliveringMessagesAsJSON. That operation, combined with listScheduledMessagesAsJSON, should give you everything you need, right?

Jan 29 '24 21:01 jbertram

...if this was an operation instead of an attribute, then you wouldn't have the issue of scans from JMX providers, right?

@clebertsuconic, that would be much better, but I still feel like there's a better solution here.

Jan 29 '24 21:01 jbertram

There is a management method which shows details of in-flight messages. It's called listDeliveringMessagesAsJSON. That operation, combined with listScheduledMessagesAsJSON, should give you everything you need, right?

It should but there are two things I don't like about this approach:

The output requires some postprocessing, and therefore it is fairly unusable from the GUI.
There's nothing that stops the broker from transforming and exporting significant amount of data when all I need is one number.

I still feel like there's a better solution here

Again, I'm all ears. The bigger picture is that fast response time is a must, or I'm stopping a car assembly line (literally), so every help is much appreciated. :-)

Jan 30 '24 05:01 jsmucr

After thinking about it more I haven't come up with any good alternatives. I'm OK with merging this now.

Feb 17 '24 01:02 jbertram