swift-distributed-actors icon indicating copy to clipboard operation
swift-distributed-actors copied to clipboard

Migrate Cluster system to Swift structured concurrency

Open akbashev opened this issue 6 months ago • 5 comments

The cluster system was written before swift structured concurrency was introduced and actually uses its own concurrency runtime. That works fine, but adds some complexity and introduces problems for further development—one must keep two concepts in mind and provide ways to glue them together. Not to mention keeping dedicated concurrency parts and having to deal with two simultaneously is less scalable.

Natural solution for that is to move library to structured concurrency. There are two possible ways to start this migration:

  • Remove old concurrency runtime step by step, class by class, and replace it with new structure concurrency.
  • Rewrite system from scratch and use structured concurrency directly.

First way seems safe, but not straightforward—even though old actor references and behavior related files are marked as deprecated, system is built on top of those abstractions. So touching any of them leads to massive chain of changes.

Second seems massive and breaking, but actually way more straightforward—we already have clear picture of the system, and just need to rewrite/refactor to desired state. This will also give opportunity to refactor related repositories (e.g. swift-cluster-membership) with cleaner APIs.

After talking with @ktoso he proposed going full rewrite way with following steps:

  • [x] Start a new system implementation, just doing the remote call. Static list of nodes, no actor references or behaviors.
  • [ ] Make SWIM (swift-cluster-membership) just async and remove protocols from it.
  • [ ] Adopt in new cluster membership into the system for failure detection.
  • [ ] Use some consensus (e.g., Raft) for the membership.
  • [ ] Plugin infrastructure, so we have can have the singleton and ability to extend system (event sourcing, virtual actors).
  • [ ] For the network take care about the nodes joining each other—there’s a race there.
  • [ ] Modernize the NIO networking layer with Swift concurrency features.
  • [ ] (optional) Add tracing for easier debugging and observability.

Note that this is all high level, so each step could be clarified if needed. Also, even though all steps are high level—this is all part of one big refactor, so careful approach should be considered for refactoring and submitting PRs.

akbashev avatar May 31 '25 10:05 akbashev

Thanks for the issue and steps here. Yeap I think this approach will be most likely to succeed. Ping me anytime if you'd like to chat

ktoso avatar May 31 '25 10:05 ktoso

so careful approach should be considered for refactoring and submitting PRs.

@ktoso can we create a separate refactoring branch in repo for that? My idea is that I'll create a NewClusterSystem package and gradually will move/delete old files from old system and create PRs for each step.

akbashev avatar May 31 '25 10:05 akbashev

@ktoso I wish, but this is not complete yet 😁

akbashev avatar Oct 17 '25 06:10 akbashev