rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

Add Compressor interface

Open lucagiac81 opened this issue 5 years ago • 25 comments

This PR refactors compression by introducing a Compressor interface. This is a first step towards supporting compressors as plugins. PR #6717 covers the next step.

Compressor class

The Compressor class defines an interface for each compression algorithm to implement. It is a Customizable class, like other extensible components in RocksDB.

A Compressor has

  • A unique name
  • Compress/Uncompress methods
  • CreateDictionary method (for algorithms supporting dictionaries): the logic to build/train a dictionary is moved here, so future compressors have the option to customize it if necessary
  • Methods to handle processed/digested dictionaries (implemented by zstd, for example)
  • Options: each Compressor can define the options it supports using the Configurable framework

Streaming compression is not included in the Compressor class yet. The plan is to cover that in a separate PR.

Built-in compressors

The existing compression algorithms (referred to as "built-in") are encapsulated in Compressor classes. The classes inherit from BuiltinCompressor, which includes functionality shared by all built-in compressors. Built-in compressors can be referenced by their numeric id (as defined by the CompressionType enum) to ensure backward compatibility. BuiltinCompressor uses the existing CompressionOptions struct as its configurable options.

Compressor lifecycle

For this PR, compression options exposed in the public API are unchanged (exposing Compressor in the public API and options is the scope of PR #6717).

  • The CompressionType and CompressionOptions passed through ColumnFamilyOptions and DBOptions are used to instantiate suitable Compressor instances (this is done in MutableCFOptions and ImmutableDBOptions).
  • The Compressor class keeps track of the instances currently in use, so they can be retrieved and reused.
  • Such Compressor instances are then used in other parts of the code in place of CompressionType and CompressionOptions (e.g., in BlockBasedTableBuilder, BlockBasedTable, FlushJob, Compaction, BlobFileBuilder...).
  • The details of the Compressor used for a block-based table are serialized in the Properties block.
  • When opening the SST file, the info in the Properties block is used to instantiate/retrieve a suitable Compressor for the table. If the compression id for a block doesn't match the table-level Compressor, a suitable Compressor is obtained when reading the block.

lucagiac81 avatar Nov 09 '20 20:11 lucagiac81

Addressed some of the comments above and rebased branch. There are two additional changes from the previous version.

  • compressor.h: added DictCompressionSupported method in Compressor (needed after PR #7619).
  • options_helper.cc: GetSupportedCompressions and GetSupportedDictCompressions iterate over CompressorRegistry types, rather than using compression_type_string_map.

lucagiac81 avatar Nov 17 '20 18:11 lucagiac81

Keep in mind that for us to accept this, the performance penalty would have to be minimal.

pdillinger avatar Mar 31 '22 20:03 pdillinger

Absolutely. Performance is a key requirement for this feature. From the data I've collected so far with db_bench (readrandom and readrandomwriterandom workloads), the performance difference is in general <1% (throughput, p99 latency). I'm collecting more data covering different conditions and additional metrics, so we can have a more accurate picture. Are there specific conditions you'd recommend for performance testing, or do you have a set of benchmarks you already use to validate the effect of any code changes?

lucagiac81 avatar Mar 31 '22 22:03 lucagiac81

fillseq and readrandom with -readonly should suffice, with low block cache hit rate on data blocks. You can do readrandom on the same DB before-and-after to eliminate LSM structure variances. But I would consider 1% overall ops/sec as a significant regression for Meta to absorb for a feature that Meta doesn't need.

pdillinger avatar Apr 04 '22 16:04 pdillinger

I ran fillseq and readrandom benchmarks, comparing the PR against the previous commit on the main branch. Workload parameters: 16-byte keys, 256-byte values, 4kB blocks, ZSTD/LZ4 compression, db on /dev/shm, no block cache.

The relative ops/s differences (PR vs main, mean of 5 runs) are listed below, although they are not statistically significant:

  • fillseq: +0.07% (ZSTD), -0.1% (LZ4)
  • readrandom: -0.18% (ZSTD), +0.51% (LZ4)

The instruction path length is also verified to be matched within 0.2%. As we make changes as part of the review, we can verify performance stays matched with the latest code.

lucagiac81 avatar Apr 18 '22 00:04 lucagiac81

@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar May 17 '22 11:05 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar May 24 '22 03:05 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar May 31 '22 19:05 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jun 03 '22 18:06 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jun 27 '22 01:06 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jul 26 '22 17:07 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jul 27 '22 00:07 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Oct 11 '22 22:10 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Nov 04 '22 22:11 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jan 09 '23 18:01 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Feb 09 '23 18:02 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Apr 27 '23 14:04 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar May 01 '23 18:05 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Oct 05 '23 20:10 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Oct 06 '23 00:10 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jan 16 '24 20:01 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jan 16 '24 22:01 facebook-github-bot

@lucagiac81 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Jan 16 '24 23:01 facebook-github-bot

By the way, we are finally very interested in this line of features. Thanks for working on it!

pdillinger avatar Feb 05 '24 23:02 pdillinger

By the way, we are finally very interested in this line of features. Thanks for working on it!

That's great, thank you! I'll work on the items above and let's continue the review. I'll also rebase, as I see some new conflicts.

lucagiac81 avatar Feb 08 '24 01:02 lucagiac81