rocksdb Avoid exception if fail to insert block to block-cache

Expected behavior

If fail to insert block to block-cache, just abot the insert but return the result as normal.

Actual behavior

If we use default ReadOptions with strict_capacity_limit block cache, RocksDB would throw exception if Insert failed due to LRU cache being full.

Steps to reproduce the behavior

This is because current RetrieveBlock would check the status of MaybeReadBlockAndLoadToCache:

    s = MaybeReadBlockAndLoadToCache(
        prefetch_buffer, ro, handle, uncompression_dict, wait_for_cache,
        block_entry, block_type, get_context, lookup_context,
        /*contents=*/nullptr);

    if (!s.ok()) {
      return s;
    }

And if fail to insert, the block cache would report incomplete status:

    if ((usage_ + total_charge) > capacity_ &&
        (strict_capacity_limit_ || handle == nullptr)) {
      e->SetInCache(false);
      if (handle == nullptr) {
        // Don't insert the entry but still return ok, as if the entry inserted
        // into cache and get evicted immediately.
        last_reference_list.push_back(e);
      } else {
        if (free_handle_on_fail) {
          delete[] reinterpret_cast<char*>(e);
          *handle = nullptr;
        }
        s = Status::Incomplete("Insert failed due to LRU cache being full.");
      }

As more and more applications move on cloud, and fine-granularity memory control is required in many use cases. I think enabling the strict capacity limit of block cache in production environment should be something valueable. How about this solution:

Introduce a new status as FailInsertCache.
Introduce a new filed named fill_cache_if_possible in ReadOptions, which means if cannot fill block to cache, we would not treat FailInsertCache status as a problem, and continue the read process as normal.

What do you think of this problem and the propsed solution?

Aug 17 '21 08:08 Myasuka

cc @zhichao-cao , @ltamasi @akankshamahajan15 who might be interested in this topic.

Aug 17 '21 09:08 Myasuka

Thanks for the suggestion. So there will be a new error code in Status as Status::FailInsertCache()?

Aug 17 '21 17:08 zhichao-cao

@zhichao-cao Yes, this is just a proposal and hope to get your response to see whether this is practicable. If so, do you think RocksDB community could resolve it in this suggested way?

Aug 18 '21 02:08 Myasuka

If we use default ReadOptions with strict_capacity_limit block cache, RocksDB would throw exception if Insert failed due to LRU cache being full.

Correction: LRUCacheOptions

My concern is that if you are setting strict_capacity_limit=true on Cache, you are probably interested in limiting the amount of memory used to hold SST blocks. But under your proposal, iterators and active Get operations are allowed to allocate as much as they want beyond this limit, because they ignore when insertion into Cache fails. This is worse than setting strict_capacity_limit=false because at least with strict_capacity_limit=false you can cache hit on everything that is already in memory.

Aug 23 '21 22:08 pdillinger

It is a good question though - since the allocation and read already happened, does it make sense to return the value? I think it's not that straightforward. When using the PinnableSlice APIs we don't just fill a value buffer for the caller -- the whole underlying block comes with it. And this is amplified in MultiGet() or iterators.

One alternative we've been discussing is to pre-reserve cache memory before doing the allocation and read. That would prevent the case where we do an allocation and read only to throw it away and fail the operation. But we would still fail the operation -- it would just be an earlier/more efficient way to fail.

Aug 23 '21 22:08 ajkr

Thanks for your feedback!

From my previous understanding, current memory control solution in RocksDB cannot control it very well, just as you pointed out. It would first let the memory allocation happen and then choose whether to free that after inserting into the cache. This is not perfect but better than totally no memory control.

I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap. On the other hand, Apache Flink community heavily depends on this future in container environment and make the cache strictly limit could help improve the memory control. From current status, I think make strict_capacity_limit=true and not failing the get operation could be the low hanging fruit. I think this is not something against the long term design. What do you think ? @ajkr @pdillinger @zhichao-cao

BTW, I think wait for evicting lru cache until we fail the operation, that might be someting better for long-running applications.

Aug 24 '21 03:08 Myasuka

I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap.

I think the big complication is that we can't always predict the size needed for decompressed block.

Note: there are two other "lookup with intent to insert" things we could consider solving along with this:

Allocate and insert an "empty" handle after unsuccessful lookup (with intent to insert), to avoid repeating the hashing & probing between Lookup & Insert.
Use the same to allow cooperation between threads intending to insert the same entry. All but one thread would wait for the select one to finish the corresponding insert.

Jul 20 '22 04:07 pdillinger

I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap.

I think the big complication is that we can't always predict the size needed for decompressed block.

I thought the size of compressed blocks is typically known -- that is "compression format 2" which writes the size at the beginning.

The size an uncompressed block will be compressed is however unknown.

I wonder if having every Cache have a MemoryAllocator and having a smarter allocator (that allocated a Handle and could support "reallocate") would be helpful...

Jul 20 '22 16:07 mrambacher

I thought the size of compressed blocks is typically known -- that is "compression format 2" which writes the size at the beginning.

That doesn't help you predict the uncompressed size when all you have is a BlockHandle to the compressed block.

Jul 20 '22 20:07 pdillinger

rocksdb rocksdb copied to clipboard

Avoid exception if fail to insert block to block-cache

Expected behavior

Actual behavior

Steps to reproduce the behavior

rocksdb
rocksdb copied to clipboard