ARTEMIS-5773: augment cleanup of caches
Added extra cleanup of PagingManagerImpl.stores and PostOfficeImpl.duplicateIDCache kicked off from PostOfficeImpl.reapAddresses()
I appreciate that the modifications to the reaper resolve the issue, but this is kind of like bailing water out of a boat with a hole in it. I would much rather just plug the hole. In other words, we need to determine where exactly these resources are leaking and fix that.
If you can work up a real reproducer (i.e. not something using mocks), even if it uses standalone brokers, I can take it from there.
It's worth noting that duplicate detection is on by default for cluster-connections and lots of folks are using a cluster. If this was simple to reproduce I imagine this problem would have been reported long ago so my guess is that there's something fairly unique about your use-case.
In any event, you've got a handful of CheckStyle issues:
Error: src/main/java/org/apache/activemq/artemis/core/postoffice/impl/PostOfficeImpl.java:[1591,17] (indentation) Indentation: 'try' child has incorrect indentation level 16, expected level should be 15.
Error: src/main/java/org/apache/activemq/artemis/core/postoffice/impl/PostOfficeImpl.java:[1592,17] (indentation) Indentation: 'try' child has incorrect indentation level 16, expected level should be 15.
Error: src/main/java/org/apache/activemq/artemis/core/postoffice/impl/PostOfficeImpl.java:[1593,17] (indentation) Indentation: 'try' child has incorrect indentation level 16, expected level should be 15.
Error: src/main/java/org/apache/activemq/artemis/core/postoffice/impl/PostOfficeImpl.java:[1595,17] (indentation) Indentation: 'catch' child has incorrect indentation level 16, expected level should be 15.
Error: src/test/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreCleanupTest.java:[32,47] (imports) AvoidStarImport: Using the '.*' form of import should be avoided - org.junit.jupiter.api.Assertions.*.
Error: src/test/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreCleanupTest.java:[33,34] (imports) AvoidStarImport: Using the '.*' form of import should be avoided - org.mockito.Mockito.*.
Error: src/test/java/org/apache/activemq/artemis/core/postoffice/impl/DuplicateIDCacheCleanupTest.java:[43,47] (imports) AvoidStarImport: Using the '.*' form of import should be avoided - org.junit.jupiter.api.Assertions.*.
Error: src/test/java/org/apache/activemq/artemis/core/postoffice/impl/DuplicateIDCacheCleanupTest.java:[44,34] (imports) AvoidStarImport: Using the '.*' form of import should be avoided - org.mockito.Mockito.*.
For what it's worth, you can run this check locally using mvn checkstyle:check or the dev profile.
I appreciate that the modifications to the reaper resolve the issue, but this is kind of like bailing water out of a boat with a hole in it. I would much rather just plug the hole. In other words, we need to determine where exactly these resources are leaking and fix that.
If you can work up a real reproducer (i.e. not something using mocks), even if it uses standalone brokers, I can take it from there.
It's worth noting that duplicate detection is on by default for cluster-connections and lots of folks are using a cluster. If this was simple to reproduce I imagine this problem would have been reported long ago so my guess is that there's something fairly unique about your use-case.
Yes, understood.
Over the past day i've been digging into this actually from another angle. I see that none of the addresses in those caches are auto-deleted as they should be and that is a bit baffling. I am cleaning the BRIDGE caches, but these caches seem like they should also be cleaned up.
While looking into this, i noticed we have a durable queue that is subscribed to the same address that all of these other queues are using - publish/#. I think this could be the core of my problem and the timing makes sense as to why we never encountered this until about 6 months ago as this is just about the time it was added. The queue is added from an external service so it wasn't clear until now.
Could you comment on that?
Just to be uber clear, these are a sample of the queues. The issue is that the temporary queues aren't being cleaned up, and neither are their bridge counterparts. Could it be that fundamentally I've just his an edge case where amq should not clean these up, or is this the bug? (or am i off-track?)
(This is from jmx) non-durable / temporary queues:
{
"messageCount": "0",
"messagesExpired": "0",
"routingType": "MULTICAST",
"address": "publish/#",
"maxConsumers": "-1",
"groupRebalance": "false",
"consumerCount": "1",
"temporary": "true",
"groupBuckets": "-1",
"enabled": "true",
"autoDelete": "false",
"scheduledCount": "0",
"persistedPause": "false",
"messagesAdded": "9529",
"internalQueue": "false",
"delayBeforeDispatch": "-1",
"groupRebalancePauseDispatch": "false",
"directDeliver": "false",
"name": "67e6f22a-273c-44dd-ae82-7ba2a92c2a64",
"durable": "false",
"autoCreated": "false",
"purgeOnNoConsumers": "false",
"deliveringCount": "0",
"ringSize": "-1",
"messagesAcked": "9529",
"exclusive": "false",
"messagesKilled": "0",
"paused": "false",
"lastValueKey": "",
"id": "60129556963",
"filter": "",
"consumersBeforeDispatch": "0",
"groupFirstKey": "",
"user": "brokerSystemUser",
"lastValue": "false"
},
{
"messageCount": "0",
"messagesExpired": "0",
"routingType": "MULTICAST",
"address": "publish/#",
"maxConsumers": "-1",
"groupRebalance": "false",
"consumerCount": "1",
"temporary": "true",
"groupBuckets": "-1",
"enabled": "true",
"autoDelete": "false",
"scheduledCount": "0",
"persistedPause": "false",
"messagesAdded": "9529",
"internalQueue": "false",
"delayBeforeDispatch": "-1",
"groupRebalancePauseDispatch": "false",
"directDeliver": "false",
"name": "d3c7a0a1-cac3-474f-893b-8d6530571ec7",
"durable": "false",
"autoCreated": "false",
"purgeOnNoConsumers": "false",
"deliveringCount": "0",
"ringSize": "-1",
"messagesAcked": "9529",
"exclusive": "false",
"messagesKilled": "0",
"paused": "false",
"lastValueKey": "",
"id": "60129556972",
"filter": "",
"consumersBeforeDispatch": "0",
"groupFirstKey": "",
"user": "brokerSystemUser",
"lastValue": "false"
},
Here is the durable queue (temporary: false)
{
"messageCount": "0",
"messagesExpired": "0",
"routingType": "MULTICAST",
"address": "publish/#",
"maxConsumers": "-1",
"groupRebalance": "false",
"consumerCount": "2",
"temporary": "false",
"groupBuckets": "-1",
"enabled": "true",
"autoDelete": "false",
"scheduledCount": "0",
"persistedPause": "false",
"messagesAdded": "4910",
"internalQueue": "false",
"delayBeforeDispatch": "-1",
"groupRebalancePauseDispatch": "false",
"directDeliver": "false",
"name": "ebs-group.publish/#",
"durable": "true",
"autoCreated": "false",
"purgeOnNoConsumers": "false",
"deliveringCount": "0",
"ringSize": "-1",
"messagesAcked": "4910",
"exclusive": "false",
"messagesKilled": "0",
"paused": "false",
"lastValueKey": "",
"id": "60129569335",
"filter": "NOT ((AMQAddress = 'activemq.management') OR (AMQAddress = 'activemq.notifications'))",
"consumersBeforeDispatch": "0",
"groupFirstKey": "",
"user": "user",
"lastValue": "false"
},
I would suggest making this a draft and working through the issue some more until it can be determined how to best address the underlying problem.
I would suggest making this a draft and working through the issue some more until it can be determined how to best address the underlying problem.
Sounds good. I am actively working on reproducing this in a simplified form factor. I will make this a DRAFT until I have something more concrete in place to show the issue.
@jbertram I just pushed an additional update that changes SimpleAddressManager.addressWasUsed(). It fixes the second issue I referenced in the repro. The repro appears to be completely fixed when using this change. All unit tests pass in the build, but it can definitely use some scrutiny.
In poking around the changes in this area, the issue that I'm seeing seems very similar to ARTEMIS-4162..
I will run a few tests on this...
This is a legitimate issue. From what I've seen, removal of the duplicate cache is only triggered when a binding on the address is removed. Removing the address won't do it.
I've got a relevant commit on https://github.com/jbertram/artemis/tree/ARTEMIS-5773.
Just a note on this. We have been testing this on our scale systems and see that it is indeed stable. Memory and storage are not bloating and the system has been running smoothly for 1-2 weeks. If you require any code changes I’d be happy to get it tested on our end to verify.
@jbertram perhaps we add both things...
is your fix also removing DuplicateCache and other things?
The fix removes the duplicate cache and changes the criteria a little bit on address removal. If you could give that a second look I think that would be good
I mixed your branch with Justin's on a CI, and the following tests are failing. I will have to take a close look
Run / Test / 4 / org.apache.activemq.artemis.tests.integration.client.AutoDeleteAddressTest.testAutoDeleteAddressWithWildcardSubscriptionAndUsageCheck Run / Test / 0 / org.apache.activemq.artemis.tests.integration.cluster.distribution.NettyOneWayTwoNodeClusterTest.testWildcardSubscription Run / Test / 0 / org.apache.activemq.artemis.tests.integration.cluster.distribution.OnewayTwoNodeClusterTest.testWildcardSubscription Run / Test / 3 / org.apache.activemq.artemis.tests.integration.client.WildCardRoutingTest.testWildcardRoutingIncrementsMessageCount Run / Test / 3 / org.apache.activemq.artemis.tests.integration.cluster.failover.PageCleanupWhileReplicaCatchupTest.testPageCleanup Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testDuplicateCachePersisted2() persistentCache=true Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testDuplicateCachePersisted2() persistentCache=false Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testPersistXA1() persistentCache=true Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testDuplicateCachePersisted() persistentCache=true Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testDuplicateCachePersisted() persistentCache=false Run / Test / 2 / org.apache.activemq.artemis.tests.integration.DuplicateDetectionTest.testPersistTransactional() persistentCache=true Run / Test / 2 / org.apache.activemq.artemis.tests.integration.mqtt5.spec.controlpackets.PublishTests.testRetainHandlingOne Run / Test / 1 / org.apache.activemq.artemis.tests.integration.amqp.connect.AMQPBridgeToQueueTest.testBridgeRecoversLinkAfterFirstReceiverFailsWithResourceNotFound Run / Test / 1 / org.apache.activemq.artemis.tests.integration.cluster.distribution.SimpleSymmetricClusterTest.testX
This was replaced by #6153.