agatedb
agatedb copied to clipboard
Proposal: Ingest External SST
This issue describe a possible way to ingest external SST file to a running agatedb instance just as rocksdb.
Background
TiKV use ingest external SST feature for some use case. To integrate agatedb into TiKV, we need this feature.
From Creating and Ingesting SST files · facebook/rocksdb Wiki (github.com), ingesting a SST needs follow steps:
- copy or link the file(s) into the DB directory
- block (not skip) writes to the DB and assign correct sequence number to ingested file
- flush memtable if range overlap (SST file's key range overlap with memtable)
- assign the file to LSM-tree
Things go different in agatedb.
Analysis
Step 1 and step 4 are the same.
For step 3, it is unnecessary to flush memtable when overlap for agatedb. Agatedb reads all levels when get
and picks the latest one (less or equal to read_ts), so it's okay to have new keys below old keys in LSM-tree. This really makes the whole process efficient. (Please point out if I'm wrong)
For step 2, it is the most difficult: How to protect ACID during and after ingestion?
Agatedb has ACID transaction (SSI). The architecture is totally different from rocksdb. We have a read_ts and a commit_ts and we have conflict check when commit.
Possible Implementation
Main idea: Regard ingesting SST files as writing large range of keys and make the whole process a transaction.
Ingesting files shall be an expensive job.
Time Point: before commit
- Add files, fetch new file id from
LevelsController
then move or copy files to DB dir with allocated file id(name). - Verify file checksum (optional) and get file infomation (just open Table?).
- Make a new transaction, mark update as true (add a ingest flag for transaction?).
- (TBD) other checks…
Check files before starting a txn.
Time Point: commit
- Same as current, but add ingest range field to
CommittedTxn
inCommitInfo
which contains smallest and largest keys in each ingest files. This is for fast but inaccurate conflict check, otherwise we need to calc every key hash in SST which really takes time. -
Set commit_ts as Table's
global_version
- Send to write channel as usual (see below)
- Process Ingest, find suitable level for each files. (Which will take write lock of
LevelHandler
, and will block read process of other txn) - Write manifest file. The file's
global_version
is stored in manifest.
A question here. I see we protect the order of Requests send to write channel the same as commit_ts and I wonder why? For WAL in order?
If ingest job can be out-of-order, I think it's possible and better to make ingest job running in current thread and not sending to write channel.
You can see that the whole commit process has only one small I/O — append to manifest. But when picking level, it takes time if there are many tables in this level and this happens under a write lock of LevelHandler
.
Conflict Check
There are two ways for conflict check as I described in TP: commit (step 1)
.
- Calculate every key's hash in ingested files and then conflict check works as same right now. For performance problem, we can calculate key hash before transaction.
- Add ingest range field to
CommittedTxn
inCommitInfo
. And for any update txn, when adding read keys, mark smallest and largest read key for this txn. When checking conflict, beside checking hash, also check if read range overlap with ingest range. This is really inaccurate and will cause many txn conflict but I think it makes sense.
The second way needs to refactor transaction much more.
(A more simple way is to break ACID…hhh)
Global Version
This concept is the same as rocksdb. For ingested files, all keys has wrong version and it's unreasonable to modify each. When file has a global version, every key in this file has this version.
In latest rocksdb, this value is stored in manifest to avoid random write to ingested files.
We need to add a field in meta.proto and update Table iterator (also block iterator) to fit this feature.
global_version
may have ambiguity. Maybe another name when impl.
Discuss
- A new transaction struct only for ingest or refactor current?
- Which way to check conflict? (or any better idea?)
- Send to write channel or done in current thread?
- Any good idea to impliment this feature?
- Strategy to pick level. I think from the bottom to check if there is overlap is nice.
- What if external files (in one job) has overlap range? Rocksdb will put all files at level0.
A question here. I see we protect the order of Requests send to write channel the same as commit_ts and I wonder why? For WAL in order?
That's because we should make transactions stored in WAL contiguously to achieve atomic. Badger will check the consistency of WAL and truncate it if necessary.
I think maybe it's not necessary to use transaction and consider write channel. We could use oracle to perform conflict check and get commit timestamp, and after getting the commit timestamp successfully, we just update the file and place it on level 0?
We could use oracle to perform conflict check and get commit timestamp
The api of oracle needs a txn struct, I just don't want to break the api. And the conflict check of oracle needs write keys hash of transaction which is fields conflict_keys
in Transaction
.
So it's unavoidable to record write keys in ingested files.
https://github.com/tikv/agatedb/blob/475e17bce85cf5985161004c068616a594923365/src/ops/oracle.rs#L177
https://github.com/tikv/agatedb/blob/475e17bce85cf5985161004c068616a594923365/src/ops/oracle.rs#L46
after getting the commit timestamp successfully, we just update the file and place it on level 0
Due to the particularity of level0, put SST just in level0 may cause significant read amplification and write stall. Put files in lower level can reduce read/write amplification caused by compaction but will cause space amplification.
I have a look to the strategy of rocksdb.
We pick the lowest level in the LSM-Tree that satisfies these conditions
- The file can fit in the level
- The file key range don't overlap with any keys in upper layers
- The file don't overlap with the outputs of running compactions going to this level
It makes sence and I forget the third point~.
For the second point, it's because rocksdb protects new keys in higher level which is unnecessary in agatedb.
I'll try and see the result of ingest a file in rocksdb.
The api of oracle needs a txn struct, I just don't want to break the api. And the conflict check of oracle needs write keys hash of transaction which is fields conflict_keys in Transaction. So it's unavoidable to record write keys in ingested files.
We certainly should record write keys for the conflict check. I mean add method for oracle to get commit timestamp for the ingesting file scenario.
Due to the particularity of level0, put SST just in level0 may cause significant read amplification and write stall. Put files in lower level can reduce read/write amplification caused by compaction but will cause space amplification.
Yes it will cause write stall if there are many files being ingested in the same time, but in my opinion, it will not cause read amplification since we always read every SSTable in the LSM tree.
Maybe we could refer to compacting to level base and consider to put the file to level base first?
We certainly should record write keys for the conflict check. I mean add method for oracle to get commit timestamp for the ingesting file scenario.
Okay, I got it.
Yes it will cause write stall if there are many files being ingested in the same time, but in my opinion, it will not cause read amplification since we always read every SSTable in the LSM tree.
Maybe we could refer to compacting to level base and consider to put the file to level base first?
Oh yes, I was wrong for level0's read amplification but the higher we put ingested files the more read/write amplification cause by compaction we get.
What does level base mean? Target for L0 to compact to or else?
What does level base mean? Target for L0 to compact to or else?
Yes, the target level into which level 0 will merge its data.
About how to check files to ingested overlap with running compaction output.
I saw we have CompactStatus
to trace running compaction. When picking and checking $L_i$,I should check file does not overlap with levels[i].ranges
. But one situation may happen.
Suppose a running compaction pick $L_{i-1}$ range [1..50],and we have below tables at $L_i$
Level i-1: [10..50] <- pick this to compact to Level i
Level i: [1..5] [20..30] [40..60]
The next_level
(next_range
) we collect to trace range is [20..60], but the actual compaction output is [10..60]. And if the file to ingest has range [6..15], it doesn't overlap with [20..60] but overlap with [10..60].
So I can't use CompactStatus
to check overlap based on current implimentation.
Any good ideas? I think we can extend next_range
with this_range
but I'm not sure whether it's possible. @GanZiheng
I think we can extend next_range with this_range
I think that's ok.
By the way, I think this problem only happens when ingesting files, and will not appear in current implementation.
By the way, I think this problem only happens when ingesting files, and will not appear in current implementation.
Yes, it only happens when ingesting files. I didn't make myself clear.