This issue describe a possible way to ingest external SST file to a running agatedb instance just as rocksdb.

Background

TiKV use ingest external SST feature for some use case. To integrate agatedb into TiKV, we need this feature.

From Creating and Ingesting SST files · facebook/rocksdb Wiki (github.com), ingesting a SST needs follow steps:

copy or link the file(s) into the DB directory
block (not skip) writes to the DB and assign correct sequence number to ingested file
flush memtable if range overlap (SST file's key range overlap with memtable)
assign the file to LSM-tree

Things go different in agatedb.

Analysis

Step 1 and step 4 are the same.

For step 3, it is unnecessary to flush memtable when overlap for agatedb. Agatedb reads all levels when get and picks the latest one (less or equal to read_ts), so it's okay to have new keys below old keys in LSM-tree. This really makes the whole process efficient. (Please point out if I'm wrong)

For step 2, it is the most difficult: How to protect ACID during and after ingestion?

Agatedb has ACID transaction (SSI). The architecture is totally different from rocksdb. We have a read_ts and a commit_ts and we have conflict check when commit.

Possible Implementation

Main idea: Regard ingesting SST files as writing large range of keys and make the whole process a transaction.

Ingesting files shall be an expensive job.

Time Point: before commit

Add files, fetch new file id from LevelsController then move or copy files to DB dir with allocated file id(name).
Verify file checksum (optional) and get file infomation (just open Table?).
Make a new transaction, mark update as true (add a ingest flag for transaction?).
(TBD) other checks…

Check files before starting a txn.

Time Point: commit

Same as current, but add ingest range field to CommittedTxn in CommitInfo which contains smallest and largest keys in each ingest files. This is for fast but inaccurate conflict check, otherwise we need to calc every key hash in SST which really takes time.
Set commit_ts as Table's global_version
Send to write channel as usual (see below)
Process Ingest, find suitable level for each files. (Which will take write lock of LevelHandler , and will block read process of other txn)
Write manifest file. The file's global_version is stored in manifest.

A question here. I see we protect the order of Requests send to write channel the same as commit_ts and I wonder why? For WAL in order?

If ingest job can be out-of-order, I think it's possible and better to make ingest job running in current thread and not sending to write channel.

You can see that the whole commit process has only one small I/O — append to manifest. But when picking level, it takes time if there are many tables in this level and this happens under a write lock of LevelHandler .

Conflict Check

There are two ways for conflict check as I described in TP: commit (step 1) .

Calculate every key's hash in ingested files and then conflict check works as same right now. For performance problem, we can calculate key hash before transaction.
Add ingest range field to CommittedTxn in CommitInfo . And for any update txn, when adding read keys, mark smallest and largest read key for this txn. When checking conflict, beside checking hash, also check if read range overlap with ingest range. This is really inaccurate and will cause many txn conflict but I think it makes sense.

The second way needs to refactor transaction much more.

(A more simple way is to break ACID…hhh)

Global Version

This concept is the same as rocksdb. For ingested files, all keys has wrong version and it's unreasonable to modify each. When file has a global version, every key in this file has this version.

In latest rocksdb, this value is stored in manifest to avoid random write to ingested files.

We need to add a field in meta.proto and update Table iterator (also block iterator) to fit this feature.

global_version may have ambiguity. Maybe another name when impl.

Discuss

A new transaction struct only for ingest or refactor current?
Which way to check conflict? (or any better idea?)
Send to write channel or done in current thread?
Any good idea to impliment this feature?
Strategy to pick level. I think from the bottom to check if there is overlap is nice.
What if external files (in one job) has overlap range? Rocksdb will put all files at level0.

Aug 16 '22 12:08 wangnengjie

A question here. I see we protect the order of Requests send to write channel the same as commit_ts and I wonder why? For WAL in order?

That's because we should make transactions stored in WAL contiguously to achieve atomic. Badger will check the consistency of WAL and truncate it if necessary.

I think maybe it's not necessary to use transaction and consider write channel. We could use oracle to perform conflict check and get commit timestamp, and after getting the commit timestamp successfully, we just update the file and place it on level 0?

Aug 16 '22 14:08 GanZiheng

We could use oracle to perform conflict check and get commit timestamp

The api of oracle needs a txn struct, I just don't want to break the api. And the conflict check of oracle needs write keys hash of transaction which is fields conflict_keys in Transaction. So it's unavoidable to record write keys in ingested files. https://github.com/tikv/agatedb/blob/475e17bce85cf5985161004c068616a594923365/src/ops/oracle.rs#L177 https://github.com/tikv/agatedb/blob/475e17bce85cf5985161004c068616a594923365/src/ops/oracle.rs#L46

after getting the commit timestamp successfully, we just update the file and place it on level 0

Due to the particularity of level0, put SST just in level0 may cause significant read amplification and write stall. Put files in lower level can reduce read/write amplification caused by compaction but will cause space amplification.

I have a look to the strategy of rocksdb.

We pick the lowest level in the LSM-Tree that satisfies these conditions

The file can fit in the level

The file key range don't overlap with any keys in upper layers

The file don't overlap with the outputs of running compactions going to this level

It makes sence and I forget the third point~.

For the second point, it's because rocksdb protects new keys in higher level which is unnecessary in agatedb.

I'll try and see the result of ingest a file in rocksdb.

Aug 16 '22 15:08 wangnengjie

The api of oracle needs a txn struct, I just don't want to break the api. And the conflict check of oracle needs write keys hash of transaction which is fields conflict_keys in Transaction. So it's unavoidable to record write keys in ingested files.

We certainly should record write keys for the conflict check. I mean add method for oracle to get commit timestamp for the ingesting file scenario.

Due to the particularity of level0, put SST just in level0 may cause significant read amplification and write stall. Put files in lower level can reduce read/write amplification caused by compaction but will cause space amplification.

Yes it will cause write stall if there are many files being ingested in the same time, but in my opinion, it will not cause read amplification since we always read every SSTable in the LSM tree.

Maybe we could refer to compacting to level base and consider to put the file to level base first?

Aug 16 '22 16:08 GanZiheng

We certainly should record write keys for the conflict check. I mean add method for oracle to get commit timestamp for the ingesting file scenario.

Okay, I got it.

Yes it will cause write stall if there are many files being ingested in the same time, but in my opinion, it will not cause read amplification since we always read every SSTable in the LSM tree.

Maybe we could refer to compacting to level base and consider to put the file to level base first?

Oh yes, I was wrong for level0's read amplification but the higher we put ingested files the more read/write amplification cause by compaction we get.

What does level base mean? Target for L0 to compact to or else?

Aug 17 '22 03:08 wangnengjie

What does level base mean? Target for L0 to compact to or else?

Yes, the target level into which level 0 will merge its data.

Aug 17 '22 08:08 GanZiheng

About how to check files to ingested overlap with running compaction output.

I saw we have CompactStatus to trace running compaction. When picking and checking $L_i$，I should check file does not overlap with levels[i].ranges. But one situation may happen.

Suppose a running compaction pick $L_{i-1}$ range [1..50]，and we have below tables at $L_i$

Level i-1: [10..50] <- pick this to compact to Level i
Level   i: [1..5] [20..30] [40..60]

The next_level (next_range) we collect to trace range is [20..60], but the actual compaction output is [10..60]. And if the file to ingest has range [6..15], it doesn't overlap with [20..60] but overlap with [10..60].

So I can't use CompactStatus to check overlap based on current implimentation.

Any good ideas? I think we can extend next_range with this_range but I'm not sure whether it's possible. @GanZiheng

Aug 19 '22 07:08 wangnengjie

I think we can extend next_range with this_range

I think that's ok.

By the way, I think this problem only happens when ingesting files, and will not appear in current implementation.

Aug 19 '22 07:08 GanZiheng

By the way, I think this problem only happens when ingesting files, and will not appear in current implementation.

Yes, it only happens when ingesting files. I didn't make myself clear.

Aug 19 '22 08:08 wangnengjie

agatedb
agatedb copied to clipboard

Proposal: Ingest External SST

Background

Analysis

Possible Implementation

Time Point: before commit

Time Point: commit

Conflict Check

Global Version

Discuss

agatedb agatedb copied to clipboard

Proposal: Ingest External SST

Background

Analysis

Possible Implementation

Time Point: before commit

Time Point: commit

Conflict Check

Global Version

Discuss

agatedb
agatedb copied to clipboard