java-driver JAVA-3051: Memory leak

Nov 02 '23 22:11 SiyaoIsHiding

We explored several alternatives for unit testing this memory leak fix but decided to give up testing it, as we could not find a reliable way to test the garbage collector's behavior.

The unit test we wrote and considered:

https://github.com/apache/cassandra-java-driver/blob/cb61eb8c34d1dd7417c03fec95d6e13e9185d537/core/src/test/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapperMemoryLeakTest.java#L118-L142

This test:

creates two DefaultNode
creates two WeakReference pointing to the nodes, just to poke their existence later
initializes the policy
clear all the strong references
requests for garbage collection
verify the node is collected

We checked that

In my local environment (Zulu 8.72.0.17-CA-macos-aarch64), this test will succeed, and if I revert the changes from WeakHashMap to the strong HashMap, this test will fail.
Before all the strong references are cleared, poking the memory, the referring objects to the weakNode2 are: a. weakNode2 and weakReference2 b. InterceptedInvocations by Mockito c. HashMap of allNodes stored in when(metadata.getNodes()).thenReturn(allNodes); d. wanted in Equals statement in await().atMost(10, TimeUnit.SECONDS).until(() -> weakReference2.get() == null); These are all expected and no reference is leaked.
If evaluateDistance of nodes does not return IGNORED, they will be stored in BasicLoadBalancingPolicy.liveNodes. But nodes there can be removed later by onDown or onRemoved. We suppose this is intended.

We considered that

According to this post, System.gc() is more like a request/hint that some JVM will ignore, and there is no reliable way to force garbage collection. This means the test above may fail in other environments, but the last thing we want is a flaky test.

We think workarounds like generating a huge amount of garbage to trigger garbage collection may not be worth it, either.

Therefore, we concluded that no test may be the best choice for now, and the checks we perform above may be sufficient.

Nov 06 '23 20:11 SiyaoIsHiding

Very much agreed that the underlying issue here appears to be an issue with AWS Keyspaces @aratno; that's being addressed in a different ticket. The scope of this change is around preventing the (potentially indefinite) caching of Node instances within an LBP.

Feb 05 '24 23:02 absurdfarce

We still need to figure out how NodeStateEvent and DistanceEvent refer to their node. Other than that, the update PR should have resolved all the other issues.

Jun 20 '24 06:06 SiyaoIsHiding

Solved the NodeState and NodeDistance problem as discussed. All the issues should be resolved now. I would appreciate a review :)

Jun 27 '24 05:06 SiyaoIsHiding

java-driver java-driver copied to clipboard

JAVA-3051: Memory leak

The unit test we wrote and considered:

We checked that

We considered that

java-driver
java-driver copied to clipboard