rkv Bincode::serialize generates much bigger results on String types

trafficstars

Noticed this when I was investigating this TODO item. The current serialization mechanism (serialize a two-element tuple i.e. (type, value)) seems to introduced a significant amount of overheads on the String type Values.

Here is some examples:

serialize(&(1u8, true)).len() -> 2 // actual size: 2
serialize(&(2u8, 1e+9).len() -> 9 // actual size: 9 (1 + 8)
serialize(&(3u8, "hello world".to_string())).len() -> 20 // actual size: 12 (1 + 11)
serialize(&(4u8, "4dd69e99-07e7-c040-a514-ccde0cfd4781".to_string())).len() -> 45 // actual: 37 (1 + 36)

Unsure if it was caused by the padding, or by the serializations. But I think it's worth a further investigation.

Alternatively, we can just write the Type and Value directly to a buffer, then pass the result to put function. For big Values, we can avoid the double allocation by leveraging the "MDB_RESERVE" feature, which basically reserves enough space for the value, and return the buffer so that the user can populate the buffer afterwards. The following snippets illustrate the basic idea,

fn put(&self, key, value) {
    // say BIG_VALUE_THRESHOLD = 32
    let length = ::std::mem::size_of_value(&value) + 1;  // value size + type size

    if length < BIG_VALUE_THRESHOLD {
        let buf = [u8, BIG_VALIE_THRESHOLD];
        buf.write_u8(&type);
        buf.write_all(&value);
        self.txn.put(&k, &buf[..length]);
    } else {
        let mut reserved_buf = self.txn.reserve(&k, length);
        reserved_buf.write_u8(&type);
        reserved_buf.write_all(&value);
    }
}

Nov 23 '18 21:11 ncloudioj

Strings have to serialize their length. The length is stored as a 64-bit integer, therefore the additional 8 bytes. This simplifies deserialization.

Also see #109 which would mean the user is responsible for any serialization/deserialization.

Apr 25 '19 08:04 badboy

Strings have to serialize their length. The length is stored as a 64-bit integer, therefore the additional 8 bytes.

👍

Also see #109 which would mean the user is responsible for any serialization/deserialization.

Agreed, this overhead could be undesired if the consumer only wants to store some binary blobs. Even for string values, particularly short ones, that 8 bytes overhead can make the map size estimation trickier for the consumers.

Apr 25 '19 13:04 ncloudioj

rkv rkv copied to clipboard

Bincode::serialize generates much bigger results on String types

rkv
rkv copied to clipboard