Xline TXN optimization

The following is the benchmark data of xline and etcd process txn put requests

benchmark command:

benchmark --clients=${clients} txn-put --key-size=8 --val-size=256 --total=${total} --key-space-size=${key-space-size}

Xline

QPS	Latency(ms)	Time(s)	clients	total	key-space-size
324.3	3.1	6.2	1	2000	1
2768.4	3.6	3.6	10	10000	1
5429.5	9.2	3.7	50	20000	1
5498.6	18.1	3.6	100	20000	1
5485.5	36.3	5.5	200	30000	1
331.5	3.0	6.0	1	2000	100000
2700.4	3.7	3.7	10	10000	100000
5470.5	9.1	3.7	50	20000	100000
5563.0	17.9	3.6	100	20000	100000
5530.4	36.1	5.4	200	30000	100000

etcd

QPS	Latency(ms)	Time(s)	clients	total	key-space-size
935.9	1.1	2.1	1	2000	1
3341.5	3.0	3.0	10	10000	1
12219.3	4.1	1.6	50	20000	1
25641.9	3.9	0.8	100	20000	1
16737.8	11.9	1.8	200	30000	1
812.7	1.2	2.5	1	2000	100000
4034.8	2.5	2.5	10	10000	100000
13525.4	3.7	1.5	50	20000	100000
24238.7	4.1	0.8	100	20000	100000
30863.9	6.5	1.0	200	30000	100000

In our current implementation of Txn, there are some optimized parts, as follows:

[ ] Read-only Txn requests do not need to go through the consensus protocol, we can optimize read-only requests through the ReadState mechanism
[ ] Provide concurrent read and write capabilities at the storage layer

Jun 13 '23 08:06 themanforfree

We have identified the bottleneck for performance degradation in Xline TXN by profiling with perf. Here are the specific details from perf:

img_v2_bb5fcb39-dde1-46da-b7a1-ac213b742aeg

As you can see, the Index::mark_available is the bottleneck. We have tried to remove this mark_avaliable method and do the benchmark again. All the tests listed below use the same benchmark command, ./benchmark put 10 100000， which means using 10 clients to put keys. The key space size is 100K.`

engine	remove mark_available	total	slowest	fastest	avg	rps	99%
memory	no	73.6666 s	71.5 ms	2.1 ms	7.4 ms	1357.4677	13.4 ms
memory	yes	56.5029 s	77.8 ms	2.3 ms	5.6 ms	1769.8196	8.7 ms
rocksdb	no	295.0114 s	83.1 ms	6.7 ms	29.5 ms	338.9699	49.7 ms
rocksdb	yes	59.3128	98.2 ms	2.3 ms	5.9 ms	1685.9769	9.2 ms

Jul 17 '23 02:07 Phoenix500526

Current Problem

Currently, Xline uses a version-based storage schema to achieve the multi-version storage feature for key-value pairs. During the actual storage of a key-value pair (the process of sync_put_request), the following two steps are involved:

Execute Index::insert_or_update_revision to generate a KeyRevision for the key and insert it into the Index.
Convert the KeyRevision from step 1 into a Revision and store the value with the Revision as the key.

When getting a key-value pair (the process of handle_range_request), the corresponding KeyRevision needs to be read from the Index before accessing the database.

Since Xline does not implement real MVCC transactions, there is no atomicity guarantee between steps 1 and 2. This is not an issue for write requests such as PutRequest or DeleteRangeRequest because CURP's conflict detection mechanism can replace transaction functionality, ensuring that different write requests do not cause concurrency issues.

However, since RangeRequests do not go through the CURP protocol, it leads to the following problem: when handle_range_request starts executing before step 2 is completed in sync_put_request, it will be able to find the corresponding KeyRevision in the index, but the corresponding KeyValue cannot be found in the database. (Note: this problem only occurs in serial reads, and there is no such issue in linear reads).

To avoid this situation, the original approach was to add an "available" field in KeyRevision, and it is only marked as available after the corresponding KeyValue is successfully written to disk. However, this approach has performance issues as the implementation of mark_available involves complex traversal and lookup operations, leading to significant lock contention.

Possible Solutions

There are two solutions to solve this problem.

Solution One: Remove "available" and implement a compensation logic in the `RangeRequest` handling process

Specific steps:

Remove all implementations related to "available" in the Index, including mark_available, to eliminate expensive contention operations.
Implement a check in the Range request: if a Revision is read from the index, but the corresponding key value cannot be found in the database, return the old value.

Pros

Simple modification and significant performance improvement without changing the premise of a lock-free implementation.

Cons

Poor maintainability
- The Range request has various constraints such as OneKey, AllKeys, and Range, making the judgment logic complex.
- One of the invariants of the underlying database is that the number of Revisions read from the index is always equal to the number of KeyValue read from the database. This solution breaks this invariant. In the future, if there is an error causing the number of Revisions and KeyValues to be inconsistent, it will be challenging to tell between errors and compensation logic.
- It adds cognitive burden as the problem arises from sync_put_request but the resolution logic is in handle_range_request. Future modifications in these areas need to be handled with great care to avoid inadvertently introducing obscure bugs.

Solution Two: Change the operation logic

Change the storage operation logic from updating the Index first and then flushing to the disk to first successfully flushing to the disk and then updating the Index.

Specific steps:

Change the Index lock to RwLock and split insert_and_update_revision in the Index into register_revision (for calculating the key to be flushed, read-only operation) and insert_revision (for updating the index, write-only operation).
Modify the logic of sync_put_request: first use register_revision to register the corresponding key, then generate the corresponding ops to be executed by the CommandExecutor (CE).
The CE executes flush_ops. If flush_ops contains a PutKeyValue(Revision, KeyValue) operation, it calculates the corresponding KeyRevision based on the Revision and KeyValue and inserts it into the Index.

Pros

After changing Index to RwLock, the concurrency between multiple RangeRequests is improved.
Easy to maintain, transparent to upper-level business logic, does not invade unrelated request processing flows, such as handle_range_request. It also does not need to break the invariant of the underlying database.
Resolves the transaction problem between Index updates and flushing to the disk from the root cause and can be optimized with a lock-free solution.

Cons

Extensive modifications, as it involves changes to some underlying processes and data structures, leading to changes cascading through the codebase.
The performance improvement may not be as significant as Solution One because there is still lock contention. However, compared to the original solution, there will be improvements in two aspects:

Jul 21 '23 10:07 Phoenix500526

Xline Xline copied to clipboard

TXN optimization

Xline

etcd

Current Problem

Possible Solutions

Solution One: Remove "available" and implement a compensation logic in the RangeRequest handling process

Pros

Cons

Solution Two: Change the operation logic

Pros

Cons

Xline
Xline copied to clipboard

Solution One: Remove "available" and implement a compensation logic in the `RangeRequest` handling process