dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

high CPU during slots migrations on the destination shard

Open romange opened this issue 6 months ago • 10 comments

When executing slot migration on the destination shard we see an elevated CPU - much higher than the source shard. To see the profile map use:

pprof -http 0.0.0.0:8080 profile003.pb.gz

profile003.pb.gz

romange avatar Jul 13 '25 10:07 romange

  1. JournalReader::ReadString is inefficient - and does double memory copy and possibly enlarges io_buf due to
if (auto ec = EnsureRead(size); ec)
   return make_unexpected(ec);

instead of reading what's possible into buffer directly, and then filling up directly from source_. This would keep io_buf_ for using mostly for metadata reads. This is especially relevant for slots migrations that use RESTORE command with relatively large string blobs.

  1. Current loop is very simplified - we parse and run command one by one but ExecuteTx has inherent hop latency. Instead we could have an array of TransactionData and then read several of them as long as sum of command_buf allocations are below some limit. Yes, we could block on a socket with some pending transactions not yet applied but it's not important because slot migrations are eventually consistent . And then instead of using service_->DispatchCommand every time, we could use Service::DispatchManyCommands that runs multiple commands in a single hop - improving both latency and the CPU.

romange avatar Jul 14 '25 04:07 romange

More comments: Do not define classes bundled with long implementations together (ClusterShardMigration). It's not very readable.

romange avatar Jul 14 '25 04:07 romange

Another bottleneck is that we send strings in ascii-unpacked form and then pack them again in the destination (7% spent on packing)

dranikpg avatar Jul 22 '25 09:07 dranikpg

we do that in full sync and it works super fast so I believe it's not the reason. one of the things I would check is around thread locality of slots migrations. For example the replicating connection on master is migrated to the thread it reads from. With slots the direction is reversed. Do we have thread affinity on the master? do we have it on the slave?

romange avatar Jul 22 '25 11:07 romange

Do we have thread affinity on the master?

We do on master, we don't have on slave

dranikpg avatar Jul 22 '25 11:07 dranikpg

Ok, we do not have on replica for full sync as well, so maybe it's not that.

romange avatar Jul 22 '25 12:07 romange

We throttle on the target node to reduce CPU usage on migration process #5715

BorysTheDev avatar Sep 02 '25 12:09 BorysTheDev

@andydunstall set up the environment and I've made several tests:

  1. Full slots migration: migration takes < 20% CPU on the target node
  2. Half slots migration: migration takes < 20% CPU on the target node
  3. With SET traffic, full slots migration: migration takes < 80% CPU, avg CPU usage on the target node consumes 10-20% more than on the source node
  4. With SET traffic, half slots migration: migration takes < 40% CPU, avg CPU usage on the target node consumes 10-20% more than on the source node
  5. with slot_migration_throttle_us = 20: CPU usage increased by 3-5% in comparison when slot_migration_throttle_us = 0
  6. with slot_migration_throttle_us = 100 and SET traffic: CPU usage significantly drops by 45%, but migration speed also drops.

BorysTheDev avatar Sep 26 '25 14:09 BorysTheDev

what's a full slot/half slot?

romange avatar Sep 26 '25 14:09 romange

I mean all slots - 16384 slots and half slots - 8190 slots

BorysTheDev avatar Sep 26 '25 17:09 BorysTheDev