atomix
atomix copied to clipboard
Leak on server shutdown while still awaiting other nodes to join
Expected behavior
Restarting Atomix
configured to use N raft nodes, bit still not connected to any nodes, should not leak any Atomix
instance.
Actual behavior
DefaultRaftServer::shutdown()
is not closing it's RaftContext
and cleaning up RaftContext::threadContext
tasks.
Steps to reproduce
Create a single raft node (expecting N nodes in total), start it and stop it right after, while awaiting it to stop.
After running a full GC, the Atomix
instance is leaking (some screenshots below).
Minimal yet complete reproducer code (or URL to code)
public static Atomix createAtomix(String localMemberId,
File dataDirectory,
Map<String, Address> nodes) {
final Address localAddress = nodes.get(localMemberId);
if (localAddress == null) {
throw new IllegalArgumentException("the local member id should been included in the node map");
}
final AtomixBuilder atomixBuilder = Atomix.builder().withMemberId(localMemberId).withAddress(localAddress);
atomixBuilder
.withMembershipProvider(BootstrapDiscoveryProvider.builder()
.withNodes(
nodes.entrySet().stream()
.map(entry-> Node.builder()
.withId(entry.getKey())
.withAddress(entry.getValue())
.build())
.collect(Collectors.toList())).build());
// using Profile.consensus(members) is a short-cut of this but it won't left any config choice
atomixBuilder
.withManagementGroup(
RaftPartitionGroup.builder("system")
.withNumPartitions(1)
.withMembers(nodes.keySet())
.withStorageLevel(StorageLevel.DISK)
.withDataDirectory(new File(dataDirectory, "management"))
.build())
.withPartitionGroups(
RaftPartitionGroup.builder("data")
.withNumPartitions(1)
.withMembers(nodes.keySet())
.withStorageLevel(StorageLevel.DISK)
.withDataDirectory(new File(dataDirectory, "data"))
.build());
return atomixBuilder.build();
}
@Test
public void atomixLeak() {
File f = new File("./atomix");
f.deleteOnExit();
final String localId = "a";
final Address localAddress = Address.from("localhost:7070");
final Map<String, Address> nodes = new HashMap<>(3);
nodes.put(localId, localAddress);
nodes.put("b", Address.from("localhost:7071"));
nodes.put("c", Address.from("localhost:7072"));
Atomix atomix = createAtomix("a", f, nodes);
try {
// wait a bit in order to get the RaftServer::start called
atomix.start().get(2, TimeUnit.SECONDS);
Assert.fail();
} catch (TimeoutException te) {
try {
atomix.stop().join();
} catch (Throwable t) {
Assert.fail();
}
} catch (Throwable t) {
Assert.fail();
}
}
It's important to take an heap snapshot after atomix.stop().join()
is completed.
I'm searching how to implement a minimal reproducer using just RaftServer
for the PR I've sent to fix this.
- Atomix:
3.2.0-SNAPSHOT
Atomix
path to the GC root:
and this is the RaftContext::threadContext
that reference the ScheduledThreadPoolExecutor
and its DelayedWorkQueue
, that's keep alive the Atomix
ref as the previous screenshot shows: