high CPU during slots migrations on the destination shard
When executing slot migration on the destination shard we see an elevated CPU - much higher than the source shard. To see the profile map use:
pprof -http 0.0.0.0:8080 profile003.pb.gz
-
JournalReader::ReadStringis inefficient - and does double memory copy and possibly enlarges io_buf due to
if (auto ec = EnsureRead(size); ec)
return make_unexpected(ec);
instead of reading what's possible into buffer directly, and then filling up directly from source_.
This would keep io_buf_ for using mostly for metadata reads. This is especially relevant for slots migrations that use RESTORE command with relatively large string blobs.
- Current loop is very simplified - we parse and run command one by one but
ExecuteTxhas inherent hop latency. Instead we could have an array of TransactionData and then read several of them as long as sum ofcommand_bufallocations are below some limit. Yes, we could block on a socket with some pending transactions not yet applied but it's not important because slot migrations are eventually consistent . And then instead of usingservice_->DispatchCommandevery time, we could useService::DispatchManyCommandsthat runs multiple commands in a single hop - improving both latency and the CPU.
More comments:
Do not define classes bundled with long implementations together (ClusterShardMigration). It's not very readable.
Another bottleneck is that we send strings in ascii-unpacked form and then pack them again in the destination (7% spent on packing)
we do that in full sync and it works super fast so I believe it's not the reason. one of the things I would check is around thread locality of slots migrations. For example the replicating connection on master is migrated to the thread it reads from. With slots the direction is reversed. Do we have thread affinity on the master? do we have it on the slave?
Do we have thread affinity on the master?
We do on master, we don't have on slave
Ok, we do not have on replica for full sync as well, so maybe it's not that.
We throttle on the target node to reduce CPU usage on migration process #5715
@andydunstall set up the environment and I've made several tests:
- Full slots migration: migration takes < 20% CPU on the target node
- Half slots migration: migration takes < 20% CPU on the target node
- With SET traffic, full slots migration: migration takes < 80% CPU, avg CPU usage on the target node consumes 10-20% more than on the source node
- With SET traffic, half slots migration: migration takes < 40% CPU, avg CPU usage on the target node consumes 10-20% more than on the source node
- with slot_migration_throttle_us = 20: CPU usage increased by 3-5% in comparison when slot_migration_throttle_us = 0
- with slot_migration_throttle_us = 100 and SET traffic: CPU usage significantly drops by 45%, but migration speed also drops.
what's a full slot/half slot?
I mean all slots - 16384 slots and half slots - 8190 slots