java-driver icon indicating copy to clipboard operation
java-driver copied to clipboard

JAVA-3051: Memory leak

Open SiyaoIsHiding opened this issue 1 year ago • 4 comments

SiyaoIsHiding avatar Nov 02 '23 22:11 SiyaoIsHiding

We explored several alternatives for unit testing this memory leak fix but decided to give up testing it, as we could not find a reliable way to test the garbage collector's behavior.

The unit test we wrote and considered:

https://github.com/apache/cassandra-java-driver/blob/cb61eb8c34d1dd7417c03fec95d6e13e9185d537/core/src/test/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapperMemoryLeakTest.java#L118-L142

This test:

  1. creates two DefaultNode
  2. creates two WeakReference pointing to the nodes, just to poke their existence later
  3. initializes the policy
  4. clear all the strong references
  5. requests for garbage collection
  6. verify the node is collected

We checked that

  1. In my local environment (Zulu 8.72.0.17-CA-macos-aarch64), this test will succeed, and if I revert the changes from WeakHashMap to the strong HashMap, this test will fail.
  2. Before all the strong references are cleared, poking the memory, the referring objects to the weakNode2 are: a. weakNode2 and weakReference2 b. InterceptedInvocations by Mockito c. HashMap of allNodes stored in when(metadata.getNodes()).thenReturn(allNodes); d. wanted in Equals statement in await().atMost(10, TimeUnit.SECONDS).until(() -> weakReference2.get() == null); These are all expected and no reference is leaked.
  3. If evaluateDistance of nodes does not return IGNORED, they will be stored in BasicLoadBalancingPolicy.liveNodes. But nodes there can be removed later by onDown or onRemoved. We suppose this is intended.

We considered that

According to this post, System.gc() is more like a request/hint that some JVM will ignore, and there is no reliable way to force garbage collection. This means the test above may fail in other environments, but the last thing we want is a flaky test.

We think workarounds like generating a huge amount of garbage to trigger garbage collection may not be worth it, either.

Therefore, we concluded that no test may be the best choice for now, and the checks we perform above may be sufficient.

SiyaoIsHiding avatar Nov 06 '23 20:11 SiyaoIsHiding

Very much agreed that the underlying issue here appears to be an issue with AWS Keyspaces @aratno; that's being addressed in a different ticket. The scope of this change is around preventing the (potentially indefinite) caching of Node instances within an LBP.

absurdfarce avatar Feb 05 '24 23:02 absurdfarce

We still need to figure out how NodeStateEvent and DistanceEvent refer to their node. Other than that, the update PR should have resolved all the other issues.

SiyaoIsHiding avatar Jun 20 '24 06:06 SiyaoIsHiding

Solved the NodeState and NodeDistance problem as discussed. All the issues should be resolved now. I would appreciate a review :)

SiyaoIsHiding avatar Jun 27 '24 05:06 SiyaoIsHiding