kvrocks Improve consistency and isolation semantics by adding Context parameter to DB API

trafficstars

Search before asking

[X] I had searched in the issues and found no similar issues.

Motivation

The current DB API may use multiple read operations or have nested calls. These read operations do not use fixed snapshots, which may cause different snapshot data to be read during a single operation, causing inconsistency.

Solution

Referring to the current LatestSnapshot and GetOptions, we can add a Context parameter to each DB API, through which the API can pass a definite snapshot.

After a few simple attempts, I found that changes often have a ripple effect, requiring multiple modules to change their APIs at the same time, which can result in a huge PR that is difficult to break down into multiple smaller PRs. I'm going to try to give a draft PR in the near future to get a rough idea of what to do, and then gradually refine the changes to each module.

Are you willing to submit a PR?

[X] I'm willing to submit a PR!

May 14 '24 14:05 PokIsemaine

Need some help:

struct Context {
  explicit Context(engine::Storage *storage) : storage_(storage), snapshot_(storage->GetDB()->GetSnapshot()) {}

  engine::Storage *storage_ = nullptr;
  const rocksdb::Snapshot *snapshot_ = nullptr;
  rocksdb::WriteBatchWithIndex* batch_ = nullptr;

  Context() = default;
  rocksdb::ReadOptions GetReadOptions();
  const rocksdb::Snapshot *GetSnapShot();
};

This is the general idea: pass in the Context to the Database API, use a fixed snapshot when calling, and this snapshot will not change during the entire calling process. When we need to read our own data for the current operation, we use WriteBatchWithIndex + GetFromBatchAndDB to obtain the data batch=>db(snapshot=ctx.snapshot).

Most operations nowadays use WriteBatch. Is there any way to copy the operation sequence from WriteBatch to ctx.WriteBatchWithIndex? In this way, I only need to modify the last Write part of storage, instead of modifying ctx.WriteBatchWithIndex when the DB API modifies WriteBatch.

May 16 '24 07:05 PokIsemaine

Most operations nowadays use WriteBatch. Is there any way to copy the operation sequence from WriteBatch to ctx.WriteBatchWithIndex? In this way, I only need to modify the last Write part of storage, instead of modifying ctx.WriteBatchWithIndex when the DB API modifies WriteBatch.

rocksdb has the relation below:

WriteBatchBase
WriteBatchWithIndex : WriteBatchBase
WriteBatch : WriteBatchBase

Should we switch to WriteBatchBase in some place? Besides, WriteBatchWithIndex has a WriteBatch* GetWriteBatch() override; interface here

May 16 '24 07:05 mapleFU

Need some help:
struct Context {
  explicit Context(engine::Storage *storage) : storage_(storage), snapshot_(storage->GetDB()->GetSnapshot()) {}

  engine::Storage *storage_ = nullptr;
  const rocksdb::Snapshot *snapshot_ = nullptr;
  rocksdb::WriteBatchWithIndex* batch_ = nullptr;

  Context() = default;
  rocksdb::ReadOptions GetReadOptions();
  const rocksdb::Snapshot *GetSnapShot();
};
This is the general idea: pass in the Context to the Database API, use a fixed snapshot when calling, and this snapshot will not change during the entire calling process. When we need to read our own data for the current operation, we use WriteBatchWithIndex + GetFromBatchAndDB to obtain the data batch=>db(snapshot=ctx.snapshot).

Most operations nowadays use WriteBatch. Is there any way to copy the operation sequence from WriteBatch to ctx.WriteBatchWithIndex? In this way, I only need to modify the last Write part of storage, instead of modifying ctx.WriteBatchWithIndex when the DB API modifies WriteBatch.

Correction: Because there may be multiple Writes, they should be appended to ctx.WriteBatchWithIndex instead of simply copied

May 16 '24 07:05 PokIsemaine

@PokIsemaine I've checked that most output uses GetWriteBatch with a WriteBatchBase, would that ok for the scenerio here?

May 16 '24 09:05 mapleFU

I think there are currently two ways:

Keep the current WriteBatch, then use WriteBatch.Iterator(&handler) when writing, and refer to batch_debugger.h to write a WriteBatch::Handler that appends WriteBatch operations to WriteBatchWithIndex one by one.
Get WriteBatchWithIndex in GetWriteBatchBase, but I found that WriteBatchWithIndex cannot support all operations of WriteBatch, such as DeleteRange. Even after using WriteBatchWithIndex::GetWriteBatch after DeleteRange, we cannot index the effect of DeleteRange in Batch through GetFromBatchAndDB.

For the DeleteRange operation, maybe we need to switch to for + Delete, but I don't know if this will have a big performance impact. If it is the first type, we just add Batch but do not perform Write operation, what will happen?

我认为目前两种方式：

保留现在的 WriteBatch，然后在 Write 的时候使用 WriteBatch.Iterator(&handler), 并参考 batch_debugger.h 的方式编写一个将 WriteBatch 操作逐个追加到 WriteBatchWithIndex 的 WriteBatch::Handler。
GetWriteBatchBase 中改为获取 WriteBatchWithIndex，但是我发现 WriteBatchWithIndex 并不能支持 WriteBatch 的所有操作，例如 DeleteRange。即使使用 WriteBatchWithIndex::GetWriteBatch 后 DeleteRange，我们并不能通过 GetFromBatchAndDB 在 Batch 中索引到 DeleteRange 的效果。

对于 DeleteRange 操作，或许我们需要转为 for + Delete 的方式，但我不清楚这样是否会有很大的性能影响。如果是第一种，我们只是加入 Batch 但不进行 Write 操作，这样又会怎么样。

May 18 '24 03:05 PokIsemaine

Some other questions:

What isolation level can we expect if kvrocks requests are processed by multiple threads without using transactions? Serializable, snapshot isolation, or something else?
What resources are specifically protected by LockGuard for write operations?

另外的一些疑问：

如果不使用事务，多线程处理 kvrocks 请求，我们期望什么样的隔离级别？可串行化、快照隔离还是其他的情况？
写操作的 LockGuard 具体保护的资源是什么？

May 18 '24 03:05 PokIsemaine

(2) LockGuard protect the "keys" for operation. During writing, it first collects the key it tents to write, and Lock the all keys we would like to write

(1) Is a interesting problem, I think we're looking forward to a SI ( snapshot isolation ). This cannot avoid concurrency read-modify-write sequence to same key. And it would making single write or multiple write operations seeing the same snapshot

May 18 '24 05:05 mapleFU

For the DeleteRange operation, maybe we need to switch to for + Delete, but I don't know if this will have a big performance impact. If it is the first type, we just add Batch but do not perform Write operation, what will happen?

This should be unacceptable for the performance consideration. For example, it might cause the Kvrocks stop to service if it runs the FLUSH[DB|ALL] command with a large number of keys in the DB. To mitigate this, it'd be better to disallow using the DeleteRange operation in the transaction if necessary.

May 23 '24 05:05 git-hulk

I also think maybe we should checkout how we use DeleteRange. It can always be a performance hack since it's a "hack" in implementation

May 23 '24 05:05 mapleFU

For the DeleteRange operation, maybe we need to switch to for + Delete, but I don't know if this will have a big performance impact. If it is the first type, we just add Batch but do not perform Write operation, what will happen?

This should be unacceptable for the performance consideration. For example, it might cause the Kvrocker stop to service if it runs the FLUSH[DB|ALL] command with a large number of keys in the DB. To mitigate this, it'd be better to disallow using the DeleteRange operation in the transaction if necessary.

Does this impact occur when db_->Write is executed? What if I just use WriteBatchWithIndex to log the operations performed but don't do a Write ?

May 23 '24 05:05 PokIsemaine

https://github.com/facebook/rocksdb/wiki/DeleteRange-Implementation

To be short DeleteRange write a explicit "range" tombstone, and this would affect scan, read, compaction...

May 23 '24 05:05 mapleFU

I also think maybe we should checkout how we use DeleteRange. It can always be a performance hack since it's a "hack" in implementation

Currently it mainly affects kvrocks2redis, rdb and some test cases.

May 23 '24 05:05 PokIsemaine

Progress synchronization:

Keep the current WriteBatch, then use WriteBatch.Iterator(&handler) when writing, and refer to batch_debugger.h to write a WriteBatch::Handler that appends WriteBatch operations to WriteBatchWithIndex one by one.

I'm trying the first option but currently in a specific version of GitHub Action, the golang test times out and I'm currently troubleshooting why. https://github.com/apache/kvrocks/compare/unstable...PokIsemaine:kvrocks:unstable

May 23 '24 05:05 PokIsemaine

@PokIsemaine I'm sorry for didn't quite understand why we need to care about the WriteBatch. From the issue description, what we want to solve is that may use the undetermined snapshot while reading with nested read operations? If yes, passing the snapshot via the context is a good solution.

For the write operation and transaction mode, the lock of the key should be able to protect the corresponding key won't be changed, so it should be good to leave it as it is.

Help to correct me if I missed anything.

May 23 '24 05:05 git-hulk

I'm sorry for didn't quite understand why we need to care about the WriteBatch. From the issue description, what we want to solve is that may use the undetermined snapshot while reading with nested read operations? If yes, passing the snapshot via the context is a good solution.

The main-issue is framework-level read-after-write. I like a command issues a write and read after it. It's hard to ensure read "own-workspace" after this command may writing some data

May 23 '24 05:05 mapleFU

I see, thank you. But I can't remember if any commands will write-then-read except the Lua and transaction.

May 23 '24 06:05 git-hulk

@git-hulk @PokIsemaine I think currently we can first reject "DeleteRange" and check out other command works?

IMO DeleteRange is mostly used in background. Maybe I can recheck the logic use "DeleteRange" in commands

May 23 '24 06:05 mapleFU

Yeah the current conversation goes beyond the issue title. I think it's related to this project: https://summer-ospp.ac.cn/org/prodetail/249430512?lang=en&list=pro We can have separate issues for it.

May 23 '24 06:05 PragmaTwice

Yeah the current conversation goes beyond the issue title. I think it's related to this project: https://summer-ospp.ac.cn/org/prodetail/249430512?lang=en&list=pro We can have separate issues for it.

Yes, we are discussing this project. I tried changing the title of this issue and grouping it into a new OSPP Tracking issues.

May 25 '24 07:05 PokIsemaine

Well done!

Aug 24 '24 12:08 mapleFU

Well done!

Aug 24 '24 12:08 git-hulk

kvrocks kvrocks copied to clipboard

Improve consistency and isolation semantics by adding Context parameter to DB API

Search before asking

Motivation

Solution

Are you willing to submit a PR?

kvrocks
kvrocks copied to clipboard