horaedb
horaedb copied to clipboard
Refactor write method in WalManager to improve performance
Description
After #159, we pass dyn Payload
to write method of WalManager
, this may has performance issue.
It is better to decouple encoder from write method.
Proposal
We can define WalEncoder
(maybe other name), like:
pub enum WalEncoder {
Rocks(RocksLogEncoding),
Obkv(ObkvLogEncoding)
}
impl WalEncoder {
pub fn encode_to_write_batch<P:Payload>(&self, log_batch: &LogWriteBatch<P>) -> Result<(WriteBatch, u64)> {
match self {
WalEncoder::Rocks(encoder) => encoder.encode(payload),
WalEncoder::Obkv(encoder) => encoder.encode(payload),
}
}
}
Encoder
can get from walManager
.
pub trait WalManager {
fn encoder(&self) -> Encoder;
fn write(&self, wb: WriteBatch) -> Result<()>,
}
So we need encode data before write to WalManager
.
let (wb, max_sequence_num) = self.wal.encoder().encode::<XXXPayload>(log_batch)?;
self.wal.write(wb);
...
Additional context
cc: @waynexia @ShiKaiWi
Thanks for writing this up. Here are some thoughts:
WAL backend like RocksDB
or OBKV
is only responsible for persisting things. I think we need not dispatch the data encoding method on the backend. I.e., WAL's write method should operate on bytes (Vec<u8>
or dyn Payload
) rather than a specific data type (what we used to do). However, Payload::encode_to()
is writing bytes to a mutable buffer, this lazy materialization may reduce some copies compare to pre-encode the data.
So the trade offs are: (a) accept the dynamic dispatch overhead and achieve lazy materialization or (b) remove dynamic dispatch and do encode work in advance.
At a glance both are all not large consumption. But I cannot figure out which is more efficient if I have to rank them. Maybe we need a benchmark to help us to made decision. Personally I prefer the second approach (also the way you proposed). It looks more neat to me 👀
However, Payload::encode_to() is writing bytes to a mutable buffer, this lazy materialization may reduce some copies compare to pre-encode the data.
Yes, aggree. pre-encode
has this drawback.
Let me try to add a benchmark first.
At a glance both are all not large consumption. But I cannot figure out which is more efficient if I have to rank them. Maybe we need a benchmark to help us to made decision. Personally I prefer the second approach (also the way you proposed). It looks more neat to me 👀
Provided a high throughout reducing some extra copies may make a difference to the performance. However, let the benchmark decide.
Thanks @ygf11 .
It seems the two LogEncoding
are almost the same, can we use a common LogEncoding
?
https://github.com/CeresDB/ceresdb/blob/8cccedd01b2ac33768a6678f133e51a394468bda/wal/src/rocks_impl/encoding.rs#L18-L24
https://github.com/CeresDB/ceresdb/blob/8cccedd01b2ac33768a6678f133e51a394468bda/wal/src/table_kv_impl/encoding.rs#L100-L106
It seems the two
LogEncoding
are almost the same, can we use a commonLogEncoding
?https://github.com/CeresDB/ceresdb/blob/8cccedd01b2ac33768a6678f133e51a394468bda/wal/src/rocks_impl/encoding.rs#L18-L24
https://github.com/CeresDB/ceresdb/blob/8cccedd01b2ac33768a6678f133e51a394468bda/wal/src/table_kv_impl/encoding.rs#L100-L106
Sure. It seems the LogEncoding
in the table_kv_impl/encoding.rs
is a copy from rocks_impl/encoding.rs
.