java-driver
java-driver copied to clipboard
JAVA-3051: Memory leak
We explored several alternatives for unit testing this memory leak fix but decided to give up testing it, as we could not find a reliable way to test the garbage collector's behavior.
The unit test we wrote and considered:
https://github.com/apache/cassandra-java-driver/blob/cb61eb8c34d1dd7417c03fec95d6e13e9185d537/core/src/test/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapperMemoryLeakTest.java#L118-L142
This test:
- creates two
DefaultNode
- creates two
WeakReference
pointing to the nodes, just to poke their existence later - initializes the policy
- clear all the strong references
- requests for garbage collection
- verify the node is collected
We checked that
- In my local environment (Zulu 8.72.0.17-CA-macos-aarch64), this test will succeed, and if I revert the changes from
WeakHashMap
to the strongHashMap
, this test will fail. - Before all the strong references are cleared, poking the memory, the referring objects to the
weakNode2
are: a.weakNode2
andweakReference2
b.InterceptedInvocation
s by Mockito c.HashMap
ofallNodes
stored inwhen(metadata.getNodes()).thenReturn(allNodes);
d. wanted inEquals
statement inawait().atMost(10, TimeUnit.SECONDS).until(() -> weakReference2.get() == null);
These are all expected and no reference is leaked. - If
evaluateDistance
of nodes does not returnIGNORED
, they will be stored inBasicLoadBalancingPolicy.liveNodes
. But nodes there can be removed later byonDown
oronRemoved
. We suppose this is intended.
We considered that
According to this post, System.gc()
is more like a request/hint that some JVM will ignore, and there is no reliable way to force garbage collection. This means the test above may fail in other environments, but the last thing we want is a flaky test.
We think workarounds like generating a huge amount of garbage to trigger garbage collection may not be worth it, either.
Therefore, we concluded that no test may be the best choice for now, and the checks we perform above may be sufficient.
Very much agreed that the underlying issue here appears to be an issue with AWS Keyspaces @aratno; that's being addressed in a different ticket. The scope of this change is around preventing the (potentially indefinite) caching of Node instances within an LBP.
We still need to figure out how NodeStateEvent
and DistanceEvent
refer to their node
. Other than that, the update PR should have resolved all the other issues.
Solved the NodeState
and NodeDistance
problem as discussed. All the issues should be resolved now. I would appreciate a review :)