foundationdb Add an API to support splitting a given range into shards individually

Writing a continuous Key space under heavy pressure can cause hot spots, causing storage_server_write_queue_size or even process_behind. However, such hot spots may be known to the upper layer services but cannot be avoided. I wish there was an API that partitioned the shards ahead of time so that the write load was evenly distributed across multiple storageservers. In this way, the upper business can actively divide the Keys space to reduce hot spots.

Jul 15 '24 06:07 DuanChangfeng0708

Thanks for your information. @DuanChangfeng0708 Is the skewed traffic primarily from update operations (or insertions following deletions) or from insert operations? Which FDB version are you using?

Jul 15 '24 16:07 kakaiu

Thanks for your information. @DuanChangfeng0708 Is the skewed traffic primarily from update operations (or insertions following deletions) or from insert operations? Which FDB version are you using?

I'm using version 7.1.27. Sequential writes under heavy pressure can cause this problem, especially after a keyRange has been cleaned up over a long period of time. In the end, we found out that all the keyranges written with the same prefix were concentrated on a few StorageServers, so the version of StorageServer was too backward. A process_behind error occurred while RateKeeper was fetching the ServerList, which limited the LimitTPS to 0, causing service interruption. The root cause is that Keys with the same prefix are all concentrated on the same shard, causing a hot spot. In our upper layer, for better concurrency and to avoid hot spots, we also introduced the concept of shards by appending hash strings to keys with the same prefix, but it doesn't seem to work for fdb because it's still in an fdb shard. So I would like to have an api where the business layer can control the granularity of the fdb shard to make the pressure load more even and avoid hot spots.

Jul 16 '24 02:07 DuanChangfeng0708

Thanks! Currently, we have an experimental feature for manual shard split in release-7.1. The fdbcli is redistribute <BeginKey> <EndKey>. The input range is passed to data distributor and DD issues data moves including all data within the range. Suppose current shard boundary is [a, c), [c, d), [d, f) and the input range is [b, d), the DD splits the shard [a, c) into [a, b) and [b, c) and triggers data moves for [a, b), [b, c), and [c, d). On the main branch, we are developing a better solution for this requirement.

Jul 16 '24 19:07 kakaiu

Thanks! Currently, we have an experimental feature for manual shard split in release-7.1. The fdbcli is redistribute <BeginKey> <EndKey>. The input range is passed to data distributor and DD issues data moves including all data within the range. Suppose current shard boundary is [a, c), [c, d), [d, f) and the input range is [b, d), the DD splits the shard [a, c) into [a, b) and [b, c) and triggers data moves for [a, b), [b, c), and [c, d). On the main branch, we are developing a better solution for this requirement.

I have read the relevant implementation code and I believe that custom shards should be persistent and should never be merged.

Consider a scenario (such as backend tasks) where a batch of tasks (with the same prefix) is created in batches at 1am every day, and all tasks are completed and cleared at 2am. If that's the case, keys written at 1am will experience hotspots due to having the same prefix. After deleting at 2am, the related shards will merge again, and the same situation will occur the next day. The purpose of customizing shards to evenly distribute the load has not been achieved.

Jul 19 '24 02:07 DuanChangfeng0708

Thanks! Currently, we have an experimental feature for manual shard split in release-7.1. The fdbcli is redistribute <BeginKey> <EndKey>. The input range is passed to data distributor and DD issues data moves including all data within the range. Suppose current shard boundary is [a, c), [c, d), [d, f) and the input range is [b, d), the DD splits the shard [a, c) into [a, b) and [b, c) and triggers data moves for [a, b), [b, c), and [c, d). On the main branch, we are developing a better solution for this requirement.

I have read the relevant implementation code and I believe that custom shards should be persistent and should never be merged.

Consider a scenario (such as backend tasks) where a batch of tasks (with the same prefix) is created in batches at 1am every day, and all tasks are completed and cleared at 2am. If that's the case, keys written at 1am will experience hotspots due to having the same prefix. After deleting at 2am, the related shards will merge again, and the same situation will occur the next day. The purpose of customizing shards to evenly distribute the load has not been achieved.

Yes. On the main branch/release-7.3, we have introduced another experimental feature command rangeconfig. This command was initially developed to create a larger team with more storage servers to handle high read traffic on a specific range. This command allows for the customization of shard boundaries, with these custom boundaries being persisted.

Jul 21 '24 00:07 kakaiu

If this requirement is to be done, there is a detail that needs to be noted. I want the range with the same prefix to be dispersed as much as possible among different teams, so I need to calculate the ratio of this range on each team.

Aug 01 '24 03:08 DuanChangfeng0708