activemq-artemis icon indicating copy to clipboard operation
activemq-artemis copied to clipboard

ARTEMIS-5773: augment cleanup of caches

Open scottmf opened this issue 1 month ago • 7 comments

Added extra cleanup of PagingManagerImpl.stores and PostOfficeImpl.duplicateIDCache kicked off from PostOfficeImpl.reapAddresses()

scottmf avatar Nov 20 '25 21:11 scottmf

I appreciate that the modifications to the reaper resolve the issue, but this is kind of like bailing water out of a boat with a hole in it. I would much rather just plug the hole. In other words, we need to determine where exactly these resources are leaking and fix that.

If you can work up a real reproducer (i.e. not something using mocks), even if it uses standalone brokers, I can take it from there.

It's worth noting that duplicate detection is on by default for cluster-connections and lots of folks are using a cluster. If this was simple to reproduce I imagine this problem would have been reported long ago so my guess is that there's something fairly unique about your use-case.

jbertram avatar Nov 21 '25 19:11 jbertram

In any event, you've got a handful of CheckStyle issues:

Error:  src/main/java/org/apache/activemq/artemis/core/postoffice/impl/PostOfficeImpl.java:[1591,17] (indentation) Indentation: 'try' child has incorrect indentation level 16, expected level should be 15.
Error:  src/main/java/org/apache/activemq/artemis/core/postoffice/impl/PostOfficeImpl.java:[1592,17] (indentation) Indentation: 'try' child has incorrect indentation level 16, expected level should be 15.
Error:  src/main/java/org/apache/activemq/artemis/core/postoffice/impl/PostOfficeImpl.java:[1593,17] (indentation) Indentation: 'try' child has incorrect indentation level 16, expected level should be 15.
Error:  src/main/java/org/apache/activemq/artemis/core/postoffice/impl/PostOfficeImpl.java:[1595,17] (indentation) Indentation: 'catch' child has incorrect indentation level 16, expected level should be 15.
Error:  src/test/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreCleanupTest.java:[32,47] (imports) AvoidStarImport: Using the '.*' form of import should be avoided - org.junit.jupiter.api.Assertions.*.
Error:  src/test/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreCleanupTest.java:[33,34] (imports) AvoidStarImport: Using the '.*' form of import should be avoided - org.mockito.Mockito.*.
Error:  src/test/java/org/apache/activemq/artemis/core/postoffice/impl/DuplicateIDCacheCleanupTest.java:[43,47] (imports) AvoidStarImport: Using the '.*' form of import should be avoided - org.junit.jupiter.api.Assertions.*.
Error:  src/test/java/org/apache/activemq/artemis/core/postoffice/impl/DuplicateIDCacheCleanupTest.java:[44,34] (imports) AvoidStarImport: Using the '.*' form of import should be avoided - org.mockito.Mockito.*.

For what it's worth, you can run this check locally using mvn checkstyle:check or the dev profile.

jbertram avatar Nov 21 '25 20:11 jbertram

I appreciate that the modifications to the reaper resolve the issue, but this is kind of like bailing water out of a boat with a hole in it. I would much rather just plug the hole. In other words, we need to determine where exactly these resources are leaking and fix that.

If you can work up a real reproducer (i.e. not something using mocks), even if it uses standalone brokers, I can take it from there.

It's worth noting that duplicate detection is on by default for cluster-connections and lots of folks are using a cluster. If this was simple to reproduce I imagine this problem would have been reported long ago so my guess is that there's something fairly unique about your use-case.

Yes, understood.

Over the past day i've been digging into this actually from another angle. I see that none of the addresses in those caches are auto-deleted as they should be and that is a bit baffling. I am cleaning the BRIDGE caches, but these caches seem like they should also be cleaned up.

While looking into this, i noticed we have a durable queue that is subscribed to the same address that all of these other queues are using - publish/#. I think this could be the core of my problem and the timing makes sense as to why we never encountered this until about 6 months ago as this is just about the time it was added. The queue is added from an external service so it wasn't clear until now.

Could you comment on that?

scottmf avatar Nov 21 '25 21:11 scottmf

Just to be uber clear, these are a sample of the queues. The issue is that the temporary queues aren't being cleaned up, and neither are their bridge counterparts. Could it be that fundamentally I've just his an edge case where amq should not clean these up, or is this the bug? (or am i off-track?)

(This is from jmx) non-durable / temporary queues:

        {
            "messageCount": "0",
            "messagesExpired": "0",
            "routingType": "MULTICAST",
            "address": "publish/#",
            "maxConsumers": "-1",
            "groupRebalance": "false",
            "consumerCount": "1",
            "temporary": "true",
            "groupBuckets": "-1",
            "enabled": "true",
            "autoDelete": "false",
            "scheduledCount": "0",
            "persistedPause": "false",
            "messagesAdded": "9529",
            "internalQueue": "false",
            "delayBeforeDispatch": "-1",
            "groupRebalancePauseDispatch": "false",
            "directDeliver": "false",
            "name": "67e6f22a-273c-44dd-ae82-7ba2a92c2a64",
            "durable": "false",
            "autoCreated": "false",
            "purgeOnNoConsumers": "false",
            "deliveringCount": "0",
            "ringSize": "-1",
            "messagesAcked": "9529",
            "exclusive": "false",
            "messagesKilled": "0",
            "paused": "false",
            "lastValueKey": "",
            "id": "60129556963",
            "filter": "",
            "consumersBeforeDispatch": "0",
            "groupFirstKey": "",
            "user": "brokerSystemUser",
            "lastValue": "false"
        },
        {
            "messageCount": "0",
            "messagesExpired": "0",
            "routingType": "MULTICAST",
            "address": "publish/#",
            "maxConsumers": "-1",
            "groupRebalance": "false",
            "consumerCount": "1",
            "temporary": "true",
            "groupBuckets": "-1",
            "enabled": "true",
            "autoDelete": "false",
            "scheduledCount": "0",
            "persistedPause": "false",
            "messagesAdded": "9529",
            "internalQueue": "false",
            "delayBeforeDispatch": "-1",
            "groupRebalancePauseDispatch": "false",
            "directDeliver": "false",
            "name": "d3c7a0a1-cac3-474f-893b-8d6530571ec7",
            "durable": "false",
            "autoCreated": "false",
            "purgeOnNoConsumers": "false",
            "deliveringCount": "0",
            "ringSize": "-1",
            "messagesAcked": "9529",
            "exclusive": "false",
            "messagesKilled": "0",
            "paused": "false",
            "lastValueKey": "",
            "id": "60129556972",
            "filter": "",
            "consumersBeforeDispatch": "0",
            "groupFirstKey": "",
            "user": "brokerSystemUser",
            "lastValue": "false"
        },

Here is the durable queue (temporary: false)

        {
            "messageCount": "0",
            "messagesExpired": "0",
            "routingType": "MULTICAST",
            "address": "publish/#",
            "maxConsumers": "-1",
            "groupRebalance": "false",
            "consumerCount": "2",
            "temporary": "false",
            "groupBuckets": "-1",
            "enabled": "true",
            "autoDelete": "false",
            "scheduledCount": "0",
            "persistedPause": "false",
            "messagesAdded": "4910",
            "internalQueue": "false",
            "delayBeforeDispatch": "-1",
            "groupRebalancePauseDispatch": "false",
            "directDeliver": "false",
            "name": "ebs-group.publish/#",
            "durable": "true",
            "autoCreated": "false",
            "purgeOnNoConsumers": "false",
            "deliveringCount": "0",
            "ringSize": "-1",
            "messagesAcked": "4910",
            "exclusive": "false",
            "messagesKilled": "0",
            "paused": "false",
            "lastValueKey": "",
            "id": "60129569335",
            "filter": "NOT ((AMQAddress = 'activemq.management') OR (AMQAddress = 'activemq.notifications'))",
            "consumersBeforeDispatch": "0",
            "groupFirstKey": "",
            "user": "user",
            "lastValue": "false"
        },

scottmf avatar Nov 22 '25 00:11 scottmf

I would suggest making this a draft and working through the issue some more until it can be determined how to best address the underlying problem.

tabish121 avatar Nov 26 '25 17:11 tabish121

I would suggest making this a draft and working through the issue some more until it can be determined how to best address the underlying problem.

Sounds good. I am actively working on reproducing this in a simplified form factor. I will make this a DRAFT until I have something more concrete in place to show the issue.

scottmf avatar Nov 26 '25 18:11 scottmf

@jbertram I just pushed an additional update that changes SimpleAddressManager.addressWasUsed(). It fixes the second issue I referenced in the repro. The repro appears to be completely fixed when using this change. All unit tests pass in the build, but it can definitely use some scrutiny.

In poking around the changes in this area, the issue that I'm seeing seems very similar to ARTEMIS-4162..

scottmf avatar Dec 02 '25 06:12 scottmf

I will run a few tests on this...

clebertsuconic avatar Dec 16 '25 19:12 clebertsuconic

This is a legitimate issue. From what I've seen, removal of the duplicate cache is only triggered when a binding on the address is removed. Removing the address won't do it.

I've got a relevant commit on https://github.com/jbertram/artemis/tree/ARTEMIS-5773.

jbertram avatar Dec 16 '25 19:12 jbertram

Just a note on this. We have been testing this on our scale systems and see that it is indeed stable. Memory and storage are not bloating and the system has been running smoothly for 1-2 weeks. If you require any code changes I’d be happy to get it tested on our end to verify.

scottmf avatar Dec 16 '25 21:12 scottmf

@jbertram perhaps we add both things...

is your fix also removing DuplicateCache and other things?

clebertsuconic avatar Dec 16 '25 23:12 clebertsuconic

The fix removes the duplicate cache and changes the criteria a little bit on address removal. If you could give that a second look I think that would be good

scottmf avatar Dec 17 '25 00:12 scottmf

I mixed your branch with Justin's on a CI, and the following tests are failing. I will have to take a close look

Run / Test / 4 / org.apache.activemq.artemis.tests.integration.client.AutoDeleteAddressTest.testAutoDeleteAddressWithWildcardSubscriptionAndUsageCheck Run / Test / 0 / org.apache.activemq.artemis.tests.integration.cluster.distribution.NettyOneWayTwoNodeClusterTest.testWildcardSubscription Run / Test / 0 / org.apache.activemq.artemis.tests.integration.cluster.distribution.OnewayTwoNodeClusterTest.testWildcardSubscription Run / Test / 3 / org.apache.activemq.artemis.tests.integration.client.WildCardRoutingTest.testWildcardRoutingIncrementsMessageCount Run / Test / 3 / org.apache.activemq.artemis.tests.integration.cluster.failover.PageCleanupWhileReplicaCatchupTest.testPageCleanup Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testDuplicateCachePersisted2() persistentCache=true Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testDuplicateCachePersisted2() persistentCache=false Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testPersistXA1() persistentCache=true Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testDuplicateCachePersisted() persistentCache=true Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testDuplicateCachePersisted() persistentCache=false Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testPersistTransactional() persistentCache=true Run / Test / 2 / org.apache.activemq.artemis.tests.integration.mqtt5.spec.controlpackets.PublishTests.testRetainHandlingOne Run / Test / 1 / org.apache.activemq.artemis.tests.integration.amqp.connect.AMQPBridgeToQueueTest.testBridgeRecoversLinkAfterFirstReceiverFailsWithResourceNotFound Run / Test / 1 / org.apache.activemq.artemis.tests.integration.cluster.distribution.SimpleSymmetricClusterTest.testX

clebertsuconic avatar Dec 17 '25 15:12 clebertsuconic

This was replaced by #6153.

jbertram avatar Dec 19 '25 04:12 jbertram