rkv
rkv copied to clipboard
Bincode::serialize generates much bigger results on String types
Noticed this when I was investigating this TODO item. The current serialization mechanism (serialize a two-element tuple i.e. (type, value)) seems to introduced a significant amount of overheads on the String type Values.
Here is some examples:
serialize(&(1u8, true)).len() -> 2 // actual size: 2
serialize(&(2u8, 1e+9).len() -> 9 // actual size: 9 (1 + 8)
serialize(&(3u8, "hello world".to_string())).len() -> 20 // actual size: 12 (1 + 11)
serialize(&(4u8, "4dd69e99-07e7-c040-a514-ccde0cfd4781".to_string())).len() -> 45 // actual: 37 (1 + 36)
Unsure if it was caused by the padding, or by the serializations. But I think it's worth a further investigation.
Alternatively, we can just write the Type and Value directly to a buffer, then pass the result to put function. For big Values, we can avoid the double allocation by leveraging the "MDB_RESERVE" feature, which basically reserves enough space for the value, and return the buffer so that the user can populate the buffer afterwards. The following snippets illustrate the basic idea,
fn put(&self, key, value) {
// say BIG_VALUE_THRESHOLD = 32
let length = ::std::mem::size_of_value(&value) + 1; // value size + type size
if length < BIG_VALUE_THRESHOLD {
let buf = [u8, BIG_VALIE_THRESHOLD];
buf.write_u8(&type);
buf.write_all(&value);
self.txn.put(&k, &buf[..length]);
} else {
let mut reserved_buf = self.txn.reserve(&k, length);
reserved_buf.write_u8(&type);
reserved_buf.write_all(&value);
}
}
Strings have to serialize their length. The length is stored as a 64-bit integer, therefore the additional 8 bytes. This simplifies deserialization.
Also see #109 which would mean the user is responsible for any serialization/deserialization.
Strings have to serialize their length. The length is stored as a 64-bit integer, therefore the additional 8 bytes.
👍
Also see #109 which would mean the user is responsible for any serialization/deserialization.
Agreed, this overhead could be undesired if the consumer only wants to store some binary blobs. Even for string values, particularly short ones, that 8 bytes overhead can make the map size estimation trickier for the consumers.