We can not create connection between 2 instances after creating and close several thousands connections
Since updating our system tests to v1.28 we're seeing some migrations are getting stuck in a CONNECTING state for 15m+, even though both the source and target nodes are healthy. ~50% of our test runs are hitting this issue
We see Migration initiating and Connecting to target node in a busy loop for a few seconds on the source node (logged ~30k times in 7 seconds), then no further output. Though SLOT-MIGRATION-STATUS returns the state is CONNECTING
There could be a regression on the control plane, though I don't see any related changes that could have caused this. As far as I can see the cluster configuration looks valid
Will keep looking and trying to reproduce, so will add more info...
da-staging datastore artifacts dst_4or9o54g2 --download ./logs
migration:migrations:
- migration_id: migration_oktfku418
started_at: 2025-03-18 09:32:00
finished_at: (unset)
status:
state: in-progress
error: ""
config:
datastore_id: dst_4or9o54g2
source:
shard_id: shard_ivuduo6oh
node_id: node_j53hhxntj
target:
shard_id: shard_5i8o7vbr5
node_id: node_h6ta41rse
slot_ranges:
- start: 10921 end: 10921 i.e. node_j53hhxntj -> node_h6ta41rse
Seeing another issue where the target says the migration has state FINISHED, but the source says it has state SYNC. Again it means the migration is just stuck forever (dst_j8a9dr440/migration_31hzqm2op). Maybe related?
I20250319 09:45:21.473965 1720 scheduler.cc:480] ------------ Fiber outgoing_migration (suspended:1056085ms) ------------ 0x555555f7e29c util::fb2::detail::FiberInterface::SwitchTo() 0x555555f7aa93 util::fb2::detail::Scheduler::Preempt() 0x555555fbb208 util::fb2::FiberCall::Get() 0x555555fc3984 util::fb2::UringSocket::Recv() 0x5555559f0349 dfly::ProtocolClient::ReadRespReply() 0x5555559f0755 dfly::ProtocolClient::SendCommandAndReadResponse() 0x55555590bd44 dfly::cluster::OutgoingMigration::SyncFb()
It looks like we can't read from the socket at all
migrations:
- migration_id: migration_7dxbpv7c3
started_at: 2025-03-19 13:34:59
finished_at: (unset)
status:
state: in-progress
error: ""
config:
datastore_id: dst_lj0vl2vi3
source:
shard_id: shard_htfg7xztu
node_id: node_1dkylhwoc
target:
shard_id: shard_tg7e7h1mu
node_id: node_fifk3846c
slot_ranges:
- start: 10922 end: 13651
I've tried to reproduce it locally in the following ways:
- generate random delay during sending config to some nodes; - No results
- send config only to 2 source nodes and don't send to a target node; - No results
- the second approach is simulating network issues using a proxy; - Get "Connection refused" error and migration can not be finished
Reducing priority as we have a short-term fix for now.
The bug can be reproduced with the next branch stuck_migration
In most cases the next command reproduces it with 50% probability pytest --count=100 -x dragonfly/cluster_test.py -k test_cluster_reconnect