Xline
Xline copied to clipboard
TXN optimization
The following is the benchmark data of xline and etcd process txn put requests
benchmark command:
benchmark --clients=${clients} txn-put --key-size=8 --val-size=256 --total=${total} --key-space-size=${key-space-size}
Xline
QPS | Latency(ms) | Time(s) | clients | total | key-space-size |
---|---|---|---|---|---|
324.3 | 3.1 | 6.2 | 1 | 2000 | 1 |
2768.4 | 3.6 | 3.6 | 10 | 10000 | 1 |
5429.5 | 9.2 | 3.7 | 50 | 20000 | 1 |
5498.6 | 18.1 | 3.6 | 100 | 20000 | 1 |
5485.5 | 36.3 | 5.5 | 200 | 30000 | 1 |
331.5 | 3.0 | 6.0 | 1 | 2000 | 100000 |
2700.4 | 3.7 | 3.7 | 10 | 10000 | 100000 |
5470.5 | 9.1 | 3.7 | 50 | 20000 | 100000 |
5563.0 | 17.9 | 3.6 | 100 | 20000 | 100000 |
5530.4 | 36.1 | 5.4 | 200 | 30000 | 100000 |
etcd
QPS | Latency(ms) | Time(s) | clients | total | key-space-size |
---|---|---|---|---|---|
935.9 | 1.1 | 2.1 | 1 | 2000 | 1 |
3341.5 | 3.0 | 3.0 | 10 | 10000 | 1 |
12219.3 | 4.1 | 1.6 | 50 | 20000 | 1 |
25641.9 | 3.9 | 0.8 | 100 | 20000 | 1 |
16737.8 | 11.9 | 1.8 | 200 | 30000 | 1 |
812.7 | 1.2 | 2.5 | 1 | 2000 | 100000 |
4034.8 | 2.5 | 2.5 | 10 | 10000 | 100000 |
13525.4 | 3.7 | 1.5 | 50 | 20000 | 100000 |
24238.7 | 4.1 | 0.8 | 100 | 20000 | 100000 |
30863.9 | 6.5 | 1.0 | 200 | 30000 | 100000 |
In our current implementation of Txn, there are some optimized parts, as follows:
- [ ] Read-only Txn requests do not need to go through the consensus protocol, we can optimize read-only requests through the
ReadState
mechanism - [ ] Provide concurrent read and write capabilities at the storage layer
We have identified the bottleneck for performance degradation in Xline TXN by profiling with perf. Here are the specific details from perf:
As you can see, the Index::mark_available
is the bottleneck. We have tried to remove this mark_avaliable
method and do the benchmark again. All the tests listed below use the same benchmark command, ./benchmark put 10 100000
, which means using 10 clients to put keys. The key space size is 100K.`
engine | remove mark_available | total | slowest | fastest | avg | rps | 99% |
---|---|---|---|---|---|---|---|
memory | no | 73.6666 s | 71.5 ms | 2.1 ms | 7.4 ms | 1357.4677 | 13.4 ms |
memory | yes | 56.5029 s | 77.8 ms | 2.3 ms | 5.6 ms | 1769.8196 | 8.7 ms |
rocksdb | no | 295.0114 s | 83.1 ms | 6.7 ms | 29.5 ms | 338.9699 | 49.7 ms |
rocksdb | yes | 59.3128 | 98.2 ms | 2.3 ms | 5.9 ms | 1685.9769 | 9.2 ms |
Current Problem
Currently, Xline uses a version-based storage schema to achieve the multi-version storage feature for key-value pairs. During the actual storage of a key-value pair (the process of sync_put_request
), the following two steps are involved:
- Execute
Index::insert_or_update_revision
to generate aKeyRevision
for the key and insert it into the Index. - Convert the
KeyRevision
from step 1 into aRevision
and store the value with theRevision
as the key.
When getting a key-value pair (the process of handle_range_request
), the corresponding KeyRevision
needs to be read from the Index
before accessing the database.
Since Xline does not implement real MVCC transactions, there is no atomicity guarantee between steps 1 and 2. This is not an issue for write requests such as PutRequest
or DeleteRangeRequest
because CURP's conflict detection mechanism can replace transaction functionality, ensuring that different write requests do not cause concurrency issues.
However, since RangeRequest
s do not go through the CURP protocol, it leads to the following problem: when handle_range_request
starts executing before step 2 is completed in sync_put_request
, it will be able to find the corresponding KeyRevision
in the index, but the corresponding KeyValue cannot be found in the database. (Note: this problem only occurs in serial reads, and there is no such issue in linear reads).
To avoid this situation, the original approach was to add an "available" field in KeyRevision
, and it is only marked as available after the corresponding KeyValue
is successfully written to disk. However, this approach has performance issues as the implementation of mark_available
involves complex traversal and lookup operations, leading to significant lock contention.
Possible Solutions
There are two solutions to solve this problem.
Solution One: Remove "available" and implement a compensation logic in the RangeRequest
handling process
Specific steps:
- Remove all implementations related to "available" in the Index, including mark_available, to eliminate expensive contention operations.
- Implement a check in the Range request: if a Revision is read from the index, but the corresponding key value cannot be found in the database, return the old value.
Pros
Simple modification and significant performance improvement without changing the premise of a lock-free implementation.
Cons
- Poor maintainability
- The Range request has various constraints such as OneKey, AllKeys, and Range, making the judgment logic complex.
- One of the invariants of the underlying database is that the number of
Revisions
read from the index is always equal to the number ofKeyValue
read from the database. This solution breaks this invariant. In the future, if there is an error causing the number ofRevision
s andKeyValue
s to be inconsistent, it will be challenging to tell between errors and compensation logic. - It adds cognitive burden as the problem arises from
sync_put_request
but the resolution logic is inhandle_range_request
. Future modifications in these areas need to be handled with great care to avoid inadvertently introducing obscure bugs.
Solution Two: Change the operation logic
Change the storage operation logic from updating the Index first and then flushing to the disk to first successfully flushing to the disk and then updating the Index.
Specific steps:
- Change the Index lock to
RwLock
and splitinsert_and_update_revision
in the Index intoregister_revision
(for calculating the key to be flushed, read-only operation) andinsert_revision
(for updating the index, write-only operation). - Modify the logic of
sync_put_request
: first useregister_revision
to register the corresponding key, then generate the corresponding ops to be executed by theCommandExecutor
(CE). - The CE executes
flush_ops
. Ifflush_ops
contains aPutKeyValue(Revision, KeyValue)
operation, it calculates the correspondingKeyRevision based
on theRevision
andKeyValue
and inserts it into the Index.
Pros
- After changing Index to
RwLock
, the concurrency between multipleRangeRequests
is improved. - Easy to maintain, transparent to upper-level business logic, does not invade unrelated request processing flows, such as
handle_range_request
. It also does not need to break the invariant of the underlying database. - Resolves the transaction problem between
Index
updates and flushing to the disk from the root cause and can be optimized with a lock-free solution.
Cons
- Extensive modifications, as it involves changes to some underlying processes and data structures, leading to changes cascading through the codebase.
- The performance improvement may not be as significant as Solution One because there is still lock contention. However, compared to the original solution, there will be improvements in two aspects: