risinglight
risinglight copied to clipboard
storage: use `waitmap` for cache
Currently, we are using moka
as our block cache. But using such library poses several problems:
- (1) Memory usage is not bounded. When an item is evicted from cache, it might still be used by the database system. e.g., someone might still hold
Arc
to a block. - (2) Doesn't support fast-path concurrent read. When there are multiple queries accessing the same block, and the block is not present in memory, the
Column
struct will fetch the same block multiple times.
Therefore, we should build our own block cache layer to avoid the problems.
Recently, I came up with https://boats.gitlab.io/blog/post/waitmap/, which seems to be able to solve (2). With some modification on the block struct, we might also be able to solve (1).
For problem (1), would Weak
be helpful?
I think the cache layer is similar to the concept of buffer pool in tranditional database perspective. The technique to build a buffer pool can be used here, too: acquire, pin and replace. Acquire: One thread must ask buffer pool for the block, and buffer pool takes care of how to read it from disk. Pin: BufferPool always increase pins on the block before returning it to the thread, and the thread must unpin it after the read finishes. Replace: BufferPool can use clock or LRU strategy to find the proper block to evict if memory is insufficent. Any block that has pin count greater than 0 can not be evicted. If in read-only mode, the eviction is very simple, otherwise, dirty block must be persisted on disk before eviction. CMU database group has excellent online course on youtube about this topic.
Exactly! Block cache in LSM systems is the same as buffer pool. But it's hard to implement the buffer pool in Rust using the same way as cpp (like bustub). The two challenges are:
- How to make an async buffer pool? pin and unpin might wait for the buffer slot to be available. Seems not trivial to implement a high-performance one in Rust.
- How to represent the ownership of one slot in buffer pool? If we take a &[u8] from buffer pool, everything using the pool needs a lifetime in Rust.
Exactly! Block cache in LSM systems is the same as buffer pool. But it's hard to implement the buffer pool in Rust using the same way as cpp (like bustub). The two challenges are:
* How to make an async buffer pool? pin and unpin might wait for the buffer slot to be available. Seems not trivial to implement a high-performance one in Rust. * How to represent the ownership of one slot in buffer pool? If we take a &[u8] from buffer pool, everything using the pool needs a lifetime in Rust.
To represent block/page, we can use Arc<RwLock<PageImpl>>
.
struct Page(Arc<RwLock<PageImpl>>);
struct PageImpl {
data: pu8; PAGE_SIZE as usize],
pin_count: u32,
dirty: bool,
}
To design an async buffer pool, my thought is to:
- design an async interface to return a channel-like receiver, caller poll the receiver to get the exact page.
- have a map maintaining inflight read requests, e.g. mappings of
BlockID
toVec<Sender<Page>>
. - for each request, generate a channel-like structure, add callback to disk IO to proxy the result to Sender, and returns Receiver to caller so that caller could
recv().await
it:
enum PageResp {
Ready(Page),
Wait(Receiver<Page>),
}
impl BufferPool {
fn fetch_page(&self, page_id: PageID) -> PageResp {
// first check if page is in memory, if yes, just return
if let Some(page) = self.page_in_mem(page_id) {
return PageResp::Ready(page)
}
// create a onetime channel for async operation.
// caller can use rx.recv().await to wait for the IO operation done.
let (mut tx, rx) = onetime_channel();
loop {
match self.merge_sender_to_inflight(page_id, tx) {
Ok(_) => {
// if sender added into infight list, the background thread will send the result,
// so just return rx
return PageResp::Wait(rx)
}
Err(tx_fail_merge) => {
// no inflight request for this page_id
match self.insert_to_infight(page_id, vec![tx_fail_merge]) {
Ok(_) => {
// start a background task to await disk manager IO
// and send results to all callers
runtime::spawn(async {
let page = self.disk_manager.fetch_page().await;
let senders = self.remove_inflight(page_id);
for sender in senders {
sender.send(page.clone()).await
}
}).detach();
return PageResp::Wait(rx)
}
Err(tx_fail_insert) => {
// can not insert because anther thread just inserts the same
// block concurrently, just retry
tx = tx_fail_insert;
}
}
}
}
}
}
}
I think PageResp
is actually anonying, and could be removed.
We can make fetch_page
method async, await inside it, and change the return type to Result<Page, Error>
, which is more convenient to use.
async fn fetch_page(&self, page_id: PageID) -> Result<Page, Error>;
- (2) Doesn't support fast-path concurrent read. When there are multiple queries accessing the same block, and the block is not present in memory, the
Column
struct will fetch the same block multiple times.
... is already solved with https://github.com/risinglightdb/risinglight/pull/556
So the current problem is to control memory usage of our system... Maybe we need to implement a real BufferPool now 🤣