bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

[Improvement] trim redundant read when scanning entry log.

Open thetumbled opened this issue 2 years ago • 2 comments

Descriptions of the changes in this PR: Remove the redundant disk read when scannig the entry log file, which could reduce the pressure of disk a lot.

Motivation

We have org.apache.bookkeeper.bookie.storage.EntryLogScanner to process the entry data when scanning the entry log file, in which process method is used to process the entry data. We have a lot of implementations for EntryLogScanner for various reason. image But some of them do not need to access all data in the entry log file. For example:

  • scanner implemented in method extractEntryLogMetadataByScanning only need to know the entrySize field, in such case we do not need to read any data in the entry.
  • scanner implemented in method org.apache.bookkeeper.bookie.InterleavedStorageRegenerateIndexOp#initiate only need to know the entryId field, in such case we only need to read the first 16 bytes(ledgerId+entryId).
  • many more simillar situations...

So, we do not need to read the entry data when scanning the entry log in some cases.

Changes

Add method getLengthToRead in EntryLogScanner to indicate how much data do we need to read, and read this mount of data only.

thetumbled avatar Dec 21 '23 08:12 thetumbled

Could you help to reviw this PR? thanks! @eolivelli @dlg99 @hangc0276 @nicoloboschi @shoothzj @zymap @wenbingshen

thetumbled avatar Dec 21 '23 09:12 thetumbled

All the changes in this PR are specific to uncommon cases, occurring only when the entrylog file's index is missing.

We have encountered cases that the ledger map in entry log is missed, maybe because the bookie crashed before flushing entry log. As long as the corrupted entry log file exists, bookie will scan the entry log by extractEntryLogMetadataByScanning to generate EntryLogMetadataMap when doing gc every gcWaitTime milliseconds (default 15min). And other cases is related to index rebuilding, which may be rarely used.

The readFromLogChannel function utilizes BufferedLogChannel, which is a RandomAccessFile channel. When reading data from the file channel, the data is pre-fetched into the OS PageCache, with a default pre-fetch size of 4KB. Even though we only retrieve the entryId, the actual data read from the disk is a minimum of 4KB.

It is common that the size of one entry reach 4MB ( we set the max batch size of Pulsar Client to be 4MB), so we can decrease 99.9% of the disk read with this enhancement.

There is a potential risk associated with this PR: We solely read the entryID without validating the entry data. If the file is corrupted, the scan operation will be unable to detect it and will incorrectly populate the index with the wrong ledger size. This introduces a high level of risk.

The fault detecting logic is not changed, in the old logic we just read entrySize amount of data and check if we have read such amount of data. In the new read type, READ_NOTHING or READ_LEDGER_ENTRY_ID, we still move the read position in the loop and check if we have read expected amount of data. I don't think it is a break change.

thetumbled avatar Dec 25 '23 08:12 thetumbled