swindon
swindon copied to clipboard
Fix inactivity callback in clustered setup
Well, I don't understand how 2dfe91a fixes the issue.
The proposed strategy is:
- Sync all inactivity timers across all the replicating nodes. Probably by grouping them in bulks with 100ms - 1s latency.
- Split session namespace into buckets using consistent hashing. Assign 1/nth share of sessions for every node
- Notify about inactivity callbacks sent using technique similar to (1)
- Assign buckets to the next servers with the delay, i.e.:
- buckets of server2 to server1 with the delay of 10 seconds
- buckets of server3 to server2 with the delay of 10 seconds
- buckets of server3 to server1 with the delay of 20 seconds, and so on
- Cancel calling handler if other server reports it already sent
This means: if one of the servers fails or lags too much we will delay its messages by just 10 seconds, but all inactivity callbacks are sent anyway (though, in complex failure scenarios ones can be duplicated, that's fine). And also this doesn't introduce any complex failure detection and leader election algorithms.
@popravich ?