bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

add async or sync for :entry location index's rocksdb write

Open StevenLuMT opened this issue 3 years ago • 11 comments

Descriptions of the changes in this PR:

Motivation

bookie timed to flush latency mainly composed of three pieces: 1. flush-entrylog: it's the latency for flushing entrylog 2. flush-locations-index: it's the latency for flushing entrly location index, use sync mode to flush 3. flush-ledger-index: it's the latency for flushing ledger metadata index, this index(LedgerMetadataIndex) use async mode to flush

reduce entry location index flush latency to reduce bookie's flush latency, so add async mode for entry location index's write: 1. default sync mode: not change the original logic 2. async mode : need update config to open, this mode is to speed up writing, this mode is the same as bookie's another index: LedgerMetadataIndex

sync is different from async:

  • sync mode:
    1. create a batch; 4. add msg to the batch 5. call method(batch.flush) to flush the batch

  • sync mode:
    1. just call method(locationsDb.sync) to write the data 2. the rocksdb server will be timed to flush the data

Changes

  1. add async or sync for :location rocksdb write
  2. add switch to open this feature, default not open, use default sync

StevenLuMT avatar Mar 13 '22 06:03 StevenLuMT

@dlg99 @eolivelli @pkumar-singh @zymap @hangc0276 @lordcheng10 @merlimat If you have time, please help me review it, thank you.

StevenLuMT avatar Mar 13 '22 06:03 StevenLuMT

@hangc0276 @merlimat please take a look

dlg99 avatar Mar 15 '22 23:03 dlg99

You are on your way! I left a suggestion for the configuration entry.

We must also add comprehensive tests

ok,thanks for reviewed, I will add testcases for this function

StevenLuMT avatar Mar 25 '22 07:03 StevenLuMT

You are on your way! I left a suggestion for the configuration entry.

We must also add comprehensive tests

yeah, I have added two testcase: EntryLocationIndexAsyncTest/EntryLocationIndexSyncTest @eolivelli

StevenLuMT avatar Jun 21 '22 08:06 StevenLuMT

I'm not sure I understand why we need to have 2 different modes here. Do both methods provide the same guarantees or not?

merlimat avatar Jun 22 '22 16:06 merlimat

I'm not sure I understand why we need to have 2 different modes here. Do both methods provide the same guarantees or not?

In the case of multiple replications, when the user pursues the writing speed and does not need to guarantee the success rate of each replication, the asynchronous writing method like LedgerMetadataIndex gives the user more choices @merlimat

StevenLuMT avatar Jun 23 '22 01:06 StevenLuMT

I still cannot see a case where flushing the entry location index (ledgerId, entryId, offset) becomes the bottleneck, compared to:

  1. Journal
  2. Entry log flush (where all the data has to be put on disk)

If it is not the bottleneck, then why reduce the guarantees? (eg: if we miss the async update, we have effectively lost data in the bookie).

merlimat avatar Jun 23 '22 02:06 merlimat

I still cannot see a case where flushing the entry location index (ledgerId, entryId, offset) becomes the bottleneck, compared to:

  1. Journal
  2. Entry log flush (where all the data has to be put on disk)

If it is not the bottleneck, then why reduce the guarantees? (eg: if we miss the async update, we have effectively lost data in the bookie).

yes, in the step : entrylog flush In order to speed up the flush, we only need to ensure that the EntryLocationIndex is successfully written under normal circumstances, regardless of machine downtime. Like LedgerMetadataIndex, it also uses an asynchronous method to write rocksdb @merlimat

StevenLuMT avatar Jun 23 '22 07:06 StevenLuMT

In order to speed up the flush, we only need to ensure that the EntryLocationIndex is successfully written under normal circumstances, regardless of machine downtime.

I'm not sure I follow here.

My point is that if flushing entry logs takes 90% and RocksDB takes 10% of time (I made up these number here), then adding the risk of losing data to shave time on the 10% portion doesn't make much sense.

merlimat avatar Jun 23 '22 18:06 merlimat

ok,I understand what you mean, let me collect the time and proportion of the three parts of flush @merlimat

StevenLuMT avatar Jun 24 '22 02:06 StevenLuMT

rerun failure checks

StevenLuMT avatar Sep 25 '22 10:09 StevenLuMT