akka.net icon indicating copy to clipboard operation
akka.net copied to clipboard

Akka.Cluster.Sharding: `Shard` can fail to `HandOff` indefinitely

Open Aaronontheweb opened this issue 9 months ago • 1 comments

Version Information Version of Akka.NET? v1.5.37 Which Akka.NET Modules? Akka.Cluster.Sharding

Describe the bug

This is a pretty rare bug as far as I can tell - today was the first time I've ever seen this log message ever get logged in 12 years of working with Akka.NET:

https://github.com/akkadotnet/akka.net/blob/1f7ffa7479152b13f201434a6791156f7f18d213/src/contrib/cluster/Akka.Cluster.Sharding/ShardCoordinator.cs#L1850-L1853

Looking more closely at the issue, we see A LOT of unhandled HandOff messages over the course of 10-30 minutes:

2025-02-11 12:49:24.376 [INFO][02/11/2025 18:49:24.376Z][Thread 0003][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [86] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:48:14.376 [INFO][02/11/2025 18:48:14.376Z][Thread 0007][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [44] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:47:54.367 [INFO][02/11/2025 18:47:54.367Z][Thread 0033][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [74] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:46:44.371 [INFO][02/11/2025 18:46:44.371Z][Thread 0033][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [46] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:45:34.369 [INFO][02/11/2025 18:45:34.369Z][Thread 0033][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [7] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:44:34.369 [INFO][02/11/2025 18:44:34.368Z][Thread 0016][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [48] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:43:24.370 [INFO][02/11/2025 18:43:24.370Z][Thread 0025][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [17] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:42:24.367 [INFO][02/11/2025 18:42:24.367Z][Thread 0003][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [62] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:39:44.364 [INFO][02/11/2025 18:39:44.364Z][Thread 0010][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [83] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:38:34.373 [INFO][02/11/2025 18:38:34.373Z][Thread 0023][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [47] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:37:34.361 [INFO][02/11/2025 18:37:34.361Z][Thread 0024][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to 

This continues indefinitely.

To Reproduce

Not sure how to reproduce it yet.

Expected behavior

Shards should terminate their entities during a handoff and deallocate all entity actors.

Actual behavior

Not only did the shard not deallocate, but it looks like it didn't attempt to kill off any of its entity actors - otherwise the fail safe from the HandoffStopper should kick in:

https://github.com/akkadotnet/akka.net/blob/6ffd304224925f376affb0de993eeb3e31d3fa11/src/contrib/cluster/Akka.Cluster.Sharding/ShardRegion.cs#L315-L324

This didn't happen, so it makes me think that the Shard got behavior-switched to a state where it couldn't receive HandOff messages long before actually attempting to hand off.

Screenshots If applicable, add screenshots to help explain your problem.

Environment Are you running on Linux? Windows? Docker? Which version of .NET?

Additional context

  • Happened when scaling the sharding system up to double its original node count
  • Custom entity handoff message was used

Aaronontheweb avatar Feb 11 '25 19:02 Aaronontheweb

In my call notes with the affected user I point out that this might be the "poisoned" behavior:

https://github.com/akkadotnet/akka.net/blob/322c494d19137d59e0d5bb5defccf5a822374c2c/src/contrib/cluster/Akka.Cluster.Sharding/Shard.cs#L1631-L1641

But again, the fail-safe from the HandOffStopper should have kicked in if that were the case.

Aaronontheweb avatar Feb 11 '25 19:02 Aaronontheweb