atomix Leak on server shutdown while still awaiting other nodes to join

Leak on server shutdown while still awaiting other nodes to join

Open franz1981 opened this issue 3 years ago • 1 comments

Expected behavior

Restarting Atomix configured to use N raft nodes, bit still not connected to any nodes, should not leak any Atomix instance.

Actual behavior

DefaultRaftServer::shutdown() is not closing it's RaftContext and cleaning up RaftContext::threadContext tasks.

Steps to reproduce

Create a single raft node (expecting N nodes in total), start it and stop it right after, while awaiting it to stop. After running a full GC, the Atomix instance is leaking (some screenshots below).

Minimal yet complete reproducer code (or URL to code)

   public static Atomix createAtomix(String localMemberId,
                                     File dataDirectory,
                                     Map<String, Address> nodes) {
      final Address localAddress = nodes.get(localMemberId);
      if (localAddress == null) {
         throw new IllegalArgumentException("the local member id should been included in the node map");
      }
      final AtomixBuilder atomixBuilder = Atomix.builder().withMemberId(localMemberId).withAddress(localAddress);
      atomixBuilder
         .withMembershipProvider(BootstrapDiscoveryProvider.builder()
                                    .withNodes(
                                       nodes.entrySet().stream()
                                          .map(entry-> Node.builder()
                                             .withId(entry.getKey())
                                             .withAddress(entry.getValue())
                                             .build())
                                          .collect(Collectors.toList())).build());
      // using Profile.consensus(members) is a short-cut of this but it won't left any config choice
      atomixBuilder
         .withManagementGroup(
            RaftPartitionGroup.builder("system")
               .withNumPartitions(1)
               .withMembers(nodes.keySet())
               .withStorageLevel(StorageLevel.DISK)
               .withDataDirectory(new File(dataDirectory, "management"))
               .build())
         .withPartitionGroups(
            RaftPartitionGroup.builder("data")
               .withNumPartitions(1)
               .withMembers(nodes.keySet())
               .withStorageLevel(StorageLevel.DISK)
               .withDataDirectory(new File(dataDirectory, "data"))
               .build());
      return atomixBuilder.build();
   }

   @Test
   public void atomixLeak() {
      File f = new File("./atomix");
      f.deleteOnExit();
      final String localId = "a";
      final Address localAddress = Address.from("localhost:7070");
      final Map<String, Address> nodes = new HashMap<>(3);
      nodes.put(localId, localAddress);
      nodes.put("b", Address.from("localhost:7071"));
      nodes.put("c", Address.from("localhost:7072"));
      Atomix atomix = createAtomix("a", f, nodes);
      try {
         // wait a bit in order to get the RaftServer::start called
         atomix.start().get(2, TimeUnit.SECONDS);
         Assert.fail();
      } catch (TimeoutException te) {
         try {
            atomix.stop().join();
         } catch (Throwable t) {
            Assert.fail();
         }
      } catch (Throwable t) {
         Assert.fail();
      }
   }

It's important to take an heap snapshot after atomix.stop().join() is completed. I'm searching how to implement a minimal reproducer using just RaftServer for the PR I've sent to fix this.

Atomix: 3.2.0-SNAPSHOT

Apr 15 '21 06:04 franz1981

Atomix path to the GC root: Screenshot from 2021-04-14 21-54-22

and this is the RaftContext::threadContext that reference the ScheduledThreadPoolExecutor and its DelayedWorkQueue, that's keep alive the Atomix ref as the previous screenshot shows:

Apr 15 '21 06:04 franz1981

atomix atomix copied to clipboard

Leak on server shutdown while still awaiting other nodes to join

atomix
atomix copied to clipboard