kvrocks icon indicating copy to clipboard operation
kvrocks copied to clipboard

[BUG] Transaction can't guarantee atomicity

Open ShooterIT opened this issue 3 years ago • 7 comments

Describe the bug The exclusive guard lock ensures you run the transaction in isolation, but it cannot guarantee the atomicity of the EXEC stage. You would like to either have all commands applied to the db or none of them applied. However, if some commands are executed and persisted to rocksdb and the process fails before the tx completes the db sees an intermediate state which is not desired.

Expected behavior Guarantee the atomicity and consistency.

ShooterIT avatar Feb 16 '22 12:02 ShooterIT

Thanks for opening this issue @ShooterIT As of late 2021 RocksDB supports transactions , such that no other threads can see the changes in a transaction until it has been committed. I didnt check the code yet but I assume they already take care of consistency also in case of a failure during the commit phase.

I suggest to use the API they are providing. This would also allow you to remove the global lock you are using to protect the EXEC stage.

Eshcar avatar Feb 16 '22 14:02 Eshcar

Thanks @Eshcar Yes, i know RocksDB supports, but i worry transaction db would break kvrocks db usage, I will rethink it. BTW, initially we just want to implement transaction like Redis

ShooterIT avatar Feb 17 '22 03:02 ShooterIT

@ShooterIT - can you demonstrate a problematic scenario that will break other kvrocks commands when using rocksdb txs? I would be happy to assist in understanding these cases.

Eshcar avatar Feb 17 '22 12:02 Eshcar

@Eshcar Sorry for the slow reply.

Currently, kvrocks directly access db when reading data from db, and write a writebatch when writing data. It doesn't care about rocksdb txs, if we want to use rocksdb txs, there are too many places that is needed to change. And in some places, we directly access DB for checkpoint, WAL..., So i am not sure we could use transaction db easily. If you have ideas, welcome to discuss.

ShooterIT avatar Mar 01 '22 02:03 ShooterIT

Indeed, supporting tx atomicity will create changes to many kvrocks files, if not all. But it has the potential of improving the quality of the code.

Here is my suggestion. At the core of it is the principle of encapsulation:

  1. db_ and storage_ fields of Database class should be private, and not be visible/accessible to all inheriting classes.
  2. Database should include an abstract DbHandler that is responsible for calling db_ and storage_ methods.
  3. SimpleDBHandler and TxDBHandler inherit from DbHandler and implement its API.
  4. When a redis command object is created its dbHandler is set (after calling the c-tor), according to the context.
  5. When running simple commands – set simpleDBHandler; when running a tx, all queued commands in the multi phase are set w the same txDBhandler.

This is the list of methods that reference db_ or storage_ and need to be covered by the handler: db_:

  1. LatestSnapShot ss(db_)
  2. db_->Get(...)
  3. db_->NewIterator(…)

storage_:

  1. LockGuard guard(…)
  2. batch.Put(…)
  3. storage_->Write()
  4. storage_->GetLockManager()
  5. storage_->Delete()
  6. storage_->IsSlotIdEncoded()
  7. storage_->DeleteRange()

Maybe there are some more methods that I missed.

An example usage of the dbHandler would be: [Examples of how to replace the code in inheriting commands, and how to implement the methods:]

  • [ ] Replace db_->Get() with a dbHandler->get()
  • SimpleDbHandler implements this by db_->Get()
  • TxDbHandler implement this by txn_db->Get(), where txn_db is opened by TransactionDB::Open when the transaction begins.
  • [ ] Replace batch.put() with dbHandler->batchPut()
  • SimpleDbHandler implements this by batch.put(), where the batch is only used by this command
  • TxDbHandler implement this by txn_db->Put(), where the same batch is used by all queued command
  • [ ] Replace storage_->Write() with dbHandler->write()
  • SimpleDbHandler implements this by storage_->Write()
  • TxDbHandler implement this by txn->Commit(), where txn is created by txn_db->BeginTransaction(), when the tx begins.

So the code and flow for commands is similar to the current code, the only difference is how specific calls are handled at the db_ level based on the context (tx or simple).

Do you see any inherent limitation in this suggestion? (except for the fact that it requires major refactoring of the code).

Eshcar avatar Mar 02 '22 20:03 Eshcar

Impressive idea, truly thanks @Eshcar

Recently, I always thought that, it will be really cool to support ACID transaction for KV NoSql. Even I also want to support interactive transaction, and we can support more features transaction, such as optional sync, isolation level.

MULTI sync yes/no  isolation  ReadCommitted/RepeatableRead
xxx
xxx
EXEC

sync: we call fsync after transactions, i.e. truly guarantee data persistent in disk isolation: we can easily implement different isolation using rocksdb capacity, of course, we don't support this in the first version.

For code refactor of storage, i ever had a big idea, i think it is not good to directly access db_(pointer of RocksDB) in different data type implementation(redis_zset.cc, redis_hash.cc...), we should not expose it. I think we should provide some simple interfaces in storage layer. such as Put(support multi keys), Get(support multi keys),Del(support multi keys), RangeRead, RangeDelete. Data type implementation layer just call these functions, and we could easily change these function implementation even we can change different storage engine. Of course, we need to change too many places.

Your suggested method is much accessible i think, though it also breaks many places. Recently, i may don't have enough spare time to do this job, i think you can have a try if you want.

ShooterIT avatar Mar 22 '22 14:03 ShooterIT

Thanks @ShooterIT I do not have the capacity to implement/test/benchmark this solution.
I would be happy to work with/guide anyone from the community that is willing to implement this change. Let me update soon if I can allocate alternative resources for this effort.

Eshcar avatar Mar 27 '22 11:03 Eshcar

Hi, @Eshcar @ShooterIT I have an idea to workaround this bug. We can add a new function to create a shared WriteBatch for the Multi-Exec command like the below:

Status Storage::Begin() {
    is_txn_mode = true;
    txn_write_batch = make_shared<*rocksdb::WriteBatch>();
}

rocksdb::WriteBatch *Storage::GetWriteBatch() {
   if (is_txn_mode) {
      return txn_write_batch.get();
   }
   return make_shared<*rocksdb::WriteBatch>();
}

Status Storage::Commit() {
    is_txn_mode = false;
    // write txn WriteBatch to RocksDB
    txn_write_batch.reset();
}

For the Command Exec, we need to explicitly call the Begin() and Commit() to enter and leave the transaction mode. So that Kvrocks will create a new WriteBatch for each writes operation if it's NOT in transaction mode, and use the shared WriteBatch to collect all write operations if it's in the transaction mode.

git-hulk avatar Feb 26 '23 15:02 git-hulk

it is limited, in transaction, there also are some reading operations, we also hope we can read committed writing operations, maybe you can implement searching WAL(rocksdb supports it) to support this feature, but we can't resolve conflicts when we commit transaction. so i prefer to just use rocksdb transaction to implement real transaction.

ShooterIT avatar Feb 27 '23 00:02 ShooterIT

Yes, RocksDB also supports the WriteBatchWithIndex to implement the Read-Your-Own-Write feature, but it needs to involve all read operations, so I didn't plan to do it in the current stage. Does it make sense to make write operations atomic first, then involve read operations?

Refer: https://rocksdb.org/blog/2015/02/27/write-batch-with-index.html

git-hulk avatar Feb 27 '23 02:02 git-hulk

but if you don't commit write(pending in txn_write_batch), you can't Read-Your-Own-Write in normal mode.

ShooterIT avatar Feb 27 '23 02:02 ShooterIT

@ShooterIT Do you mean WriteBatchWithIndex can only be used in transaction mode? I see the MongoDB will search the txn_write_batch first if exists then from the DB, so it should work if it can be used in the normal model.

git-hulk avatar Feb 27 '23 02:02 git-hulk

RocksDB also supports the WriteBatchWithIndex to implement the Read-Your-Own-Write feature, but it needs to involve all read operations, so I didn't plan to do it in the current stage

You said you don't want to use WriteBatchWithIndex currently, so you can't Read-Your-Own-Write. if you use it , of course, it can.

ShooterIT avatar Feb 27 '23 03:02 ShooterIT

oh...I got your point now. What I mean is to implement the atomic write operations first, then involves read operations at the next stage.

git-hulk avatar Feb 27 '23 03:02 git-hulk

you shouldn't ask developers not have Read-Your-Own-Write operations in transactions,

set a b
multi
set a c
get a  // if you don't commit write batch, you will get `b` instead of `c`.
set xxx yyy
...
exec

ShooterIT avatar Feb 28 '23 02:02 ShooterIT

@ShooterIT Yes, we cannot require users to avoid this, what I mean is to separate them into two PRs to make the code review simple.

git-hulk avatar Feb 28 '23 02:02 git-hulk