dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

We can not create connection between 2 instances after creating and close several thousands connections

Open andydunstall opened this issue 10 months ago • 8 comments

Since updating our system tests to v1.28 we're seeing some migrations are getting stuck in a CONNECTING state for 15m+, even though both the source and target nodes are healthy. ~50% of our test runs are hitting this issue

We see Migration initiating and Connecting to target node in a busy loop for a few seconds on the source node (logged ~30k times in 7 seconds), then no further output. Though SLOT-MIGRATION-STATUS returns the state is CONNECTING

There could be a regression on the control plane, though I don't see any related changes that could have caused this. As far as I can see the cluster configuration looks valid

Will keep looking and trying to reproduce, so will add more info...

andydunstall avatar Mar 18 '25 14:03 andydunstall

da-staging datastore artifacts dst_4or9o54g2 --download ./logs

migration:migrations:

  • migration_id: migration_oktfku418 started_at: 2025-03-18 09:32:00 finished_at: (unset) status: state: in-progress error: "" config: datastore_id: dst_4or9o54g2 source: shard_id: shard_ivuduo6oh node_id: node_j53hhxntj target: shard_id: shard_5i8o7vbr5 node_id: node_h6ta41rse slot_ranges:
    • start: 10921 end: 10921 i.e. node_j53hhxntj -> node_h6ta41rse

BorysTheDev avatar Mar 18 '25 14:03 BorysTheDev

Seeing another issue where the target says the migration has state FINISHED, but the source says it has state SYNC. Again it means the migration is just stuck forever (dst_j8a9dr440/migration_31hzqm2op). Maybe related?

andydunstall avatar Mar 18 '25 17:03 andydunstall

I20250319 09:45:21.473965 1720 scheduler.cc:480] ------------ Fiber outgoing_migration (suspended:1056085ms) ------------ 0x555555f7e29c util::fb2::detail::FiberInterface::SwitchTo() 0x555555f7aa93 util::fb2::detail::Scheduler::Preempt() 0x555555fbb208 util::fb2::FiberCall::Get() 0x555555fc3984 util::fb2::UringSocket::Recv() 0x5555559f0349 dfly::ProtocolClient::ReadRespReply() 0x5555559f0755 dfly::ProtocolClient::SendCommandAndReadResponse() 0x55555590bd44 dfly::cluster::OutgoingMigration::SyncFb()

BorysTheDev avatar Mar 19 '25 09:03 BorysTheDev

It looks like we can't read from the socket at all

BorysTheDev avatar Mar 19 '25 14:03 BorysTheDev

migrations:

  • migration_id: migration_7dxbpv7c3 started_at: 2025-03-19 13:34:59 finished_at: (unset) status: state: in-progress error: "" config: datastore_id: dst_lj0vl2vi3 source: shard_id: shard_htfg7xztu node_id: node_1dkylhwoc target: shard_id: shard_tg7e7h1mu node_id: node_fifk3846c slot_ranges:
    • start: 10922 end: 13651

dst_lj0vl2vi3.zip

BorysTheDev avatar Mar 19 '25 14:03 BorysTheDev

I've tried to reproduce it locally in the following ways:

  1. generate random delay during sending config to some nodes; - No results
  2. send config only to 2 source nodes and don't send to a target node; - No results
  3. the second approach is simulating network issues using a proxy; - Get "Connection refused" error and migration can not be finished

BorysTheDev avatar Mar 20 '25 14:03 BorysTheDev

Reducing priority as we have a short-term fix for now.

romange avatar Mar 25 '25 14:03 romange

The bug can be reproduced with the next branch stuck_migration

In most cases the next command reproduces it with 50% probability pytest --count=100 -x dragonfly/cluster_test.py -k test_cluster_reconnect

BorysTheDev avatar Mar 27 '25 08:03 BorysTheDev