Lock for key ranges instead of entire DB
Can we lock a key range in a DB while still allow clients operate on the other key ranges?
Two scenarios can benefit from this:
-
If data corruption happens at a small key range, say [b, c), the cluster can still serve traffic in the rest of key space, while we are restoring the data in [b, c).
-
If someone wants to move data in a small key range, say [b, c), from a live cluster A to another live cluster B, we can only lock that key range while still allow both clusters to serve traffic to the rest of key space.
The implementation is similar to Issue https://github.com/apple/foundationdb/issues/1044, except that we don't have to distinguish read and write lock.
We are debating two designs:
-
Client + Proxy locking. All proxies keep the locked ranges in memory and in sync such that for every transaction, if there are mutations read or modify keys in these locked ranges, the transaction is marked as failed. To reduce the work on the Proxies, each client caches the set of locked ranges so chat each client transaction can check locally without sending transaction data over. The cached ranges can be invalidated by a version (the mechanism is similar to metadata version key) piggybacked by the GRV response. If the version is obsolete, the client needs to retrieve the locked ranges from Proxies.
- The advantage of this design is its simplicity and low overhead on the cluster, since the checking is done at the client side.
- The disadvantage is that all clients need to fetch the locked ranges back in the GRV response. We’ll need to put a limit on the total allowed locked ranges, as well as the frequency of changing them, to reduce the overhead of sending many large GRV response back. Another drawback is that checking on the client side is potentially unsafe due to memory corruption and a malicious client can always bypass the check, but the risk is low (as we write the client library).
- In the case that a client caches read version and reuses the version for a later transaction, if the read version is less than a later lock transaction, then the client can proceed to commit. When a Proxy receives the transaction, the Proxy needs to verify that the client has the latest cache. In this case, the client has a stale cache, so the Proxy needs to tell the client to retry with the latest locked ranges.
-
Resolver + StorageServer locking. The locked ranges are kept in all resolvers such that for every transaction, if there are mutations change keys in these locked ranges, the transaction is marked as failed. For read only transactions, locking in resolver alone is not enough. That is, the locked ranges should also be present on all storage servers so that reads for keys within these ranges can be failed.
- The advantage is checking is on the cluster side, which is safe and requires no client-side changes.
- The disadvantage is a more complex design. Since ranges can be moved among storage servers, the information on the locked ranges need to be kept on all storage servers. We need to either keep a whole copy of locked ranges on all storage servers, or update each storage server’s locked ranges. Keeping a whole copy means a limiting factor of total number of ranges (since the information is kept in storage servers’ memory and is broadcasted) and a broadcast mechanism is needed to synchronize locked ranges from Proxies to all resolvers and storage servers (maybe by adding all tags to the mutation). Updating each storage server’s locked ranges is also very tricky, because DataDistributor can split or merge shards across different storage teams, thus requiring the locks to be moved along with the moved ranges. Since locks are probably stored in the system key space and the moved ranges are in the normal key space, the MoveKeys must update these two places atomically.
The contract we want to maintain is that: for a lock transaction T_L with a commit version C
- Any transaction
Tcommitted beforeCwon’t see the effect ofT_L, thus can proceed to commit. - Any transaction
Tstarted afterCwill see the effect ofT_L, thus may fail due to locking.
TODO lists after #3856 is merged:
- [ ] Verify when database is locked, can't lock/unlock ranges.
- [ ] Mixed lock/unlock requests
- [ ] Multiple lock/unlock ops in one request
- [ ] Limit on total locked ranges, or the size of cache.
- [ ] Add test workload, simulating leader election via locking.
- [ ] Add API with timeout for locking.
This PR introduces a range-based lock for write traffic. https://github.com/apple/foundationdb/pull/11693
The design is using commit proxy to do rejection. This design is simple and the design does not increase the complexity of the write path. This design uses system key space to persist the lock to avoid leaking the lock information to clients.
The downside is that if the lock range is fragment, the system metadata for range lock can be large, which can slow down the recovery, as the metadata is fetched from TLog to the new commit proxy in the recovery. We may want to limit the maximum lock count.
We will work on a range-based lock for read traffic.
Currently, BulkLoad is the first feature relies on this range lock: https://github.com/apple/foundationdb/pull/11741 When we loading a range, we shut down the write traffic on the range at the first step.
More thoughts: Read lock vs write lock. Exclusive lock vs. non-exclusive lock.