rocksdb
rocksdb copied to clipboard
Avoid exception if fail to insert block to block-cache
Expected behavior
If fail to insert block to block-cache, just abot the insert but return the result as normal.
Actual behavior
If we use default ReadOptions
with strict_capacity_limit
block cache, RocksDB would throw exception if Insert failed due to LRU cache being full
.
Steps to reproduce the behavior
This is because current RetrieveBlock
would check the status of MaybeReadBlockAndLoadToCache
:
s = MaybeReadBlockAndLoadToCache(
prefetch_buffer, ro, handle, uncompression_dict, wait_for_cache,
block_entry, block_type, get_context, lookup_context,
/*contents=*/nullptr);
if (!s.ok()) {
return s;
}
And if fail to insert, the block cache would report incomplete status:
if ((usage_ + total_charge) > capacity_ &&
(strict_capacity_limit_ || handle == nullptr)) {
e->SetInCache(false);
if (handle == nullptr) {
// Don't insert the entry but still return ok, as if the entry inserted
// into cache and get evicted immediately.
last_reference_list.push_back(e);
} else {
if (free_handle_on_fail) {
delete[] reinterpret_cast<char*>(e);
*handle = nullptr;
}
s = Status::Incomplete("Insert failed due to LRU cache being full.");
}
As more and more applications move on cloud, and fine-granularity memory control is required in many use cases. I think enabling the strict capacity limit of block cache in production environment should be something valueable. How about this solution:
- Introduce a new status as
FailInsertCache
. - Introduce a new filed named
fill_cache_if_possible
in ReadOptions, which means if cannot fill block to cache, we would not treatFailInsertCache
status as a problem, and continue the read process as normal.
What do you think of this problem and the propsed solution?
cc @zhichao-cao , @ltamasi @akankshamahajan15 who might be interested in this topic.
Thanks for the suggestion. So there will be a new error code in Status as Status::FailInsertCache()?
@zhichao-cao Yes, this is just a proposal and hope to get your response to see whether this is practicable. If so, do you think RocksDB community could resolve it in this suggested way?
If we use default
ReadOptions
withstrict_capacity_limit
block cache, RocksDB would throw exception ifInsert failed due to LRU cache being full
.
Correction: LRUCacheOptions
My concern is that if you are setting strict_capacity_limit=true on Cache, you are probably interested in limiting the amount of memory used to hold SST blocks. But under your proposal, iterators and active Get operations are allowed to allocate as much as they want beyond this limit, because they ignore when insertion into Cache fails. This is worse than setting strict_capacity_limit=false because at least with strict_capacity_limit=false you can cache hit on everything that is already in memory.
It is a good question though - since the allocation and read already happened, does it make sense to return the value? I think it's not that straightforward. When using the PinnableSlice APIs we don't just fill a value buffer for the caller -- the whole underlying block comes with it. And this is amplified in MultiGet() or iterators.
One alternative we've been discussing is to pre-reserve cache memory before doing the allocation and read. That would prevent the case where we do an allocation and read only to throw it away and fail the operation. But we would still fail the operation -- it would just be an earlier/more efficient way to fail.
Thanks for your feedback!
From my previous understanding, current memory control solution in RocksDB cannot control it very well, just as you pointed out. It would first let the memory allocation happen and then choose whether to free that after inserting into the cache. This is not perfect but better than totally no memory control.
I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap. On the other hand, Apache Flink community heavily depends on this future in container environment and make the cache strictly limit could help improve the memory control. From current status, I think make strict_capacity_limit=true and not failing the get operation could be the low hanging fruit. I think this is not something against the long term design. What do you think ? @ajkr @pdillinger @zhichao-cao
BTW, I think wait for evicting lru cache until we fail the operation, that might be someting better for long-running applications.
I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap.
I think the big complication is that we can't always predict the size needed for decompressed block.
Note: there are two other "lookup with intent to insert" things we could consider solving along with this:
- Allocate and insert an "empty" handle after unsuccessful lookup (with intent to insert), to avoid repeating the hashing & probing between Lookup & Insert.
- Use the same to allow cooperation between threads intending to insert the same entry. All but one thread would wait for the select one to finish the corresponding insert.
I agree that RocksDB should refactor to consider memory allocation first, however, that might take a while to do so and I don't known whether this is already in current roadmap.
I think the big complication is that we can't always predict the size needed for decompressed block.
I thought the size of compressed blocks is typically known -- that is "compression format 2" which writes the size at the beginning.
The size an uncompressed block will be compressed is however unknown.
I wonder if having every Cache have a MemoryAllocator and having a smarter allocator (that allocated a Handle and could support "reallocate") would be helpful...
I thought the size of compressed blocks is typically known -- that is "compression format 2" which writes the size at the beginning.
That doesn't help you predict the uncompressed size when all you have is a BlockHandle to the compressed block.