rocksdb [Bug|Improvement] blobDB suffers heavy write amplification in pure large-value insertion case

[Bug|Improvement] blobDB suffers heavy write amplification in pure large-value insertion case

Open wolfkdy opened this issue 1 year ago • 4 comments

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Expected behavior

Use rocksdb 7.x newly introduced integrated blobdb

enable_blob_files=true
enable_blob_garbage_collection=true
blob_garbage_collection_age_cutoff=0.25

inserting 20GB incompressible data, with valueSize = 30KB, write amplification = (iostat throughput)/(inserting throughput) should be 2(writing wal = 1, writing blobfiles=1, other small ios ignored)

Actual behavior

write amplification is 7 or larger

Why

with blob_garbage_collection_age_cutoff=0.25, compactionIterator keeps rewriting the first 25% blob files
blob files with small file numbers are relatively smaller, because they are more likely to be generated by memtable->L0 flush

with the above two reasons, blob files are rewritten again and again, while in fact, there is no garbage because this is a pure insertion case, causing unacceptable write amplification

Improvement

A critical on the existing blobgc strategy

Currently, blobgc is kinda FIFO based, a blobfile with larger file-number can not be deleted if another blobfile with smaller file-number is pinned there. For example, there are two blob files, file1(id=1, garbageSize=0, fileSize=256mb) and file2(id=2, garbageSize=255mb, fileSize=256mb), it's obviously that gc should rewrite file2 and leave file1 there. However, with current FIFO based gc, file2 can not be rewritten until file1 is rewritten. The root cause of current FIFO gc strategy is infomation losing, sst-file's meta data only recorded oldest_blob_file_number, with each sst-file's oldest_blob_file_number, rocksdb maintains global_min_blob_file_number = min(f.oldest_blob_file_number for f in all sst_files), only blobfiles with id smaller than global_min_blob_file_number can be deleted. only blobfiles with id smaller than global_min_blob_file_number can be deleted, this makes any other complex gc strategies useless.

Proposal

not only maintain oldest_blob_file_number, instead, add all related blob-file number to sst's metadata. Make it possible that blob file list can have holes.

FileMeta.linked_blobs(std::unordered_set<uint64_t>)

When ApplyFileAddition, maintain blob-meta's linked_ssts by LinkSst

      for (auto blob_file_number : f->linked_blobs) {
        MutableBlobFileMetaData* const mutable_meta =
            GetOrCreateMutableBlobFileMetaData(blob_file_number);
        if (mutable_meta) {
          mutable_meta->LinkSst(file_number);
        }    
      }

When ApplyFileDeletion, maintain blob-meta's linked_ssts by UnlinkSst

      for (auto blob_file_number : linked_blobs) {
        MutableBlobFileMetaData* const mutable_meta =
            GetOrCreateMutableBlobFileMetaData(blob_file_number);
        if (mutable_meta) {
          mutable_meta->UnlinkSst(file_number);
        }
      }

When Merging VersionEdit to generate Current version, only merge blob files with non-empty linked_ssts. Blobfiles with empty linked_ssts will be atomaticlly removed when it's metadata's reference decreased to 0.

  static void AddBlobFileIfNeeded(VersionStorageInfo* vstorage, Meta&& meta) {
    assert(vstorage);
    assert(meta);

    if (meta->GetLinkedSsts().empty()) {
      assert(meta->GetGarbageBlobCount() >= meta->GetTotalBlobCount());
      return;
    }
    vstorage->AddBlobFile(std::forward<Meta>(meta));
  }

corresponding garbage-based gc strategy

With the above optimization to blob-file lifetime maintenance, we can have more flex gc strategy. For example, we can sort blobfiles by garbageSize, pick the front 25% blobfiles and rewrite them. And in the pure-insertion case, we have no need to rewrite files with no garbages.

Jun 28 '23 03:06 wolfkdy

A few things to note here:

BlobDB's GC is integrated into the LSM tree compaction process. It doesn't actually "rewrite blob files" in the sense of iterating through all blobs in a file in one go like e.g. WiscKey; instead, it proactively relocates blobs based on a cutoff as they are encountered during compaction. If your use case is 100% insertions (i.e. no deletes or overwrites), so there is no garbage at all, this proactivity does not make sense and you can simply disable garbage collection.
GC is not done in a FIFO manner: the amount of garbage is tracked on a per-blob file basis, and any blob file that only contains garbage is removed, even if there are lower-numbered blob files with valid blobs in them. In other words, it is already possible to have "holes"; the SST-to-oldest-blob file mapping is not the primary driver of garbage collection.
Storing the full many-to-many SST-to-blob-file mapping could potentially put a lot of stress on the MANIFEST. Fortunately, it is not really necessary for what we primarily use the mapping for (which is to pick files for compactions triggered by blob_garbage_collection_force_threshold).

Jul 11 '23 19:07 ltamasi

@ltamasi

BlobDB's GC is integrated into the LSM tree compaction process. It doesn't actually "rewrite blob files" in the sense of iterating through all blobs in a file in one go like e.g. WiscKey; instead, it proactively relocates blobs based on a cutoff as they are encountered during compaction. If your use case is 100% insertions (i.e. no deletes or overwrites), so there is no garbage at all, this proactivity does not make sense and you can simply disable garbage collection.

In a real-world database, pure-insertions are un-predictable, in db-bench or testing, we may turnoff garbage collection, but in real world, user behaviors are unpredictable, how can we turn it off when users insert, and turn it on when users do deletions or updates?

BlobDB's GC is integrated into the LSM tree compaction process.

Yes, it is now. it may be better if users can trigger gc by triggering associated sst files' compaction by CompactFiles, applications may get garbage distribution by checking a column-family's live MetaData and decide which sst files to compact to release the most garbages.

GC is not done in a FIFO manner: the amount of garbage is tracked on a per-blob file basis, and any blob file that only contains garbage is removed, even if there are lower-numbered blob files with valid blobs in them. In other words, it is already possible to have "holes"; the SST-to-oldest-blob file mapping is not the primary driver of garbage collection.

Yes, a blob file with nothing but garbage can be removed in your implementation, however, a file with even only a byte of non-garbage won't be removed, in real world case, writes are always random, so the fact is that blob-files may contain 99% garbages, it's rare that a file is 100% garbage.

Storing the full many-to-many SST-to-blob-file mapping could potentially put a lot of stress on the MANIFEST. Fortunately, it is not really necessary for what we primarily use the mapping for (which is to pick files for compactions triggered by blob_garbage_collection_force_threshold).

Yes, storing linked blobs in the versionEdit of each sst may make a particular ve very large, but when or how applications use blob db? they use blobdb when values are quite large, with large values, the total counts of sst files may be quite smaller, in a TB-level database instance with valuesize = 30KB, the total sst files may be less than 1GB.

I hope you can consider it again, or leave it an option which is turned-off by default. This is really meaningful in real-world case.

Jul 12 '23 02:07 wolfkdy

In a real-world database, pure-insertions are un-predictable, in db-bench or testing, we may turnoff garbage collection, but in real world, user behaviors are unpredictable, how can we turn it off when users insert, and turn it on when users do deletions or updates?

If necessary, this could be achieved by checking the amount of garbage in the blob files (which is part of the column family metadata), and enabling/disabling GC using SetOptions as needed.

Yes, a blob file with nothing but garbage can be removed in your implementation, however, a file with even only a byte of non-garbage won't be removed, in real world case, writes are always random, so the fact is that blob-files may contain 99% garbages, it's rare that a file is 100% garbage.

We have plans to improve the write amp/space amp tradeoff but for now, there are a couple of ways of mitigating this: reducing the size of blob files and using the blob_garbage_collection_force_threshold option. (Proactively relocating blobs from old files also helps here.)

Yes, storing linked blobs in the versionEdit of each sst may make a particular ve very large, but when or how applications use blob db? they use blobdb when values are quite large, with large values, the total counts of sst files may be quite smaller, in a TB-level database instance with valuesize = 30KB, the total sst files may be less than 1GB.

It's not just a function of the number of SST files; the number of blob files matter as well, in two ways: a) there are blob file specific records in the manifest and b) if you wanted to store the full mapping, you would have to store a potentially large number of blob file numbers for each SST.

Jul 12 '23 03:07 ltamasi

@ltamasi thanks for your reply.

despite the fact that a blob file with 100% garbage data (which is too rare to happen) can be automaticlly removed, current gc strategy is almost FIFO based. FIFO based gc is too straight-forward to cover different kinds of bad cases in production.

 0.blob(no garbage),  1.blob(no garbage), 2.blob(99% garbage), 3.blob(no garbage)

It's obvious that 2.blob should be removed, 0.blob and 1.blob should be remained there, and rewrite the useful data in 2.blob into, say, 4.blob. By setting blob_garbage_collection_force_threshold or blob_garbage_age_cutoff or other parameters, 0.blob or 1.blob still has to be rewritten, which is indeed unnecessary.

Suppose there is some way we can get the ssts linked with 2.blob, we still have no way to rewrite 2.blob by compacting these linked ssts, because CompactionIterator rewrites blobs in a age_cutoff way, if 2.blob can be rewritten, 0.blob and 1.blob can also be rewritten(which is unnecessary). You mentioned that we may reduce blobfile's size, in fact, by spliting 2.blob into smaller files, we may get N files with 99% garbage, not N-1 files with 100% garbage and one file with 90% garbage, garbage may be evenly distributed in a file.

I believe it is important to maintain the N-to-M sst-blob file mapping, and provide this information in ColumnFamily's metaData. By checking the N-to-M mapping and garbage distribution, users can pass a set of to-rewrite-blobfile-ids in struct CompactionOptions to decide which blobfiles to rewrite when CompactRange/CompactFiles. CompactionIterator rewrites blob files if the file is in the to-rewrite-blobfile-ids passed in by users. In this way, it's possible for different users to have different gc strategy, which is more meaningful.

In my above case, users check and see that 2.blob has 99% garbage, by maintaining the N-to-M sst-blob file mapping, users find the sst files linked with 2.blob, these sst files may also link with other blob files. users CompactFile with to-rewrite-blobfile-ids = {2,}, CompactionIterator only rewrites 2.blob, makes a perfect gc progress.

finally,

Storing the full many-to-many SST-to-blob-file mapping could potentially put a lot of stress on the MANIFEST.

I agree. it may be an switch, turned off by default, leave it a possibility for users.

Jul 12 '23 16:07 wolfkdy

Looking forward to the improvement about BlobDB GC. It is hard to control the write amplification and space amplification in my production environment too.

Aug 02 '24 03:08 GOGOYAO

Looking forward to the improvement about BlobDB GC. It is hard to control the write amplification and space amplification in my production environment too.

many-to-many refercence of blob and sst is a proper alternative which is already proven successful in our production env.

Aug 05 '24 03:08 wolfkdy

rocksdb rocksdb copied to clipboard

[Bug|Improvement] blobDB suffers heavy write amplification in pure large-value insertion case

Expected behavior

Actual behavior

Why

Improvement

A critical on the existing blobgc strategy

Proposal

corresponding garbage-based gc strategy

rocksdb
rocksdb copied to clipboard