borsh icon indicating copy to clipboard operation
borsh copied to clipboard

Capacity allocation for vector deserialization

Open ilblackdragon opened this issue 4 years ago • 4 comments

Currently because we don't know the size of the buffer inside the deserialize method we can't predict if the length is way too large and should error out (right now it would seg fault due to memory allocation error).

See test_invalid_length for example.

ilblackdragon avatar Sep 13 '19 05:09 ilblackdragon

Currently because we don't know the size of the buffer inside the deserialize method we can't predict if the length is way too large and should error out

One way is to wrap cursor with something like this:

use std::io;

pub struct LimitingReader<R> {
    reader: R,
    limit: u64,
}

impl<R> LimitingReader<R> {
    pub fn new(reader: R, limit: u64) -> Self {
        LimitingReader {
            reader,
            limit,
        }
    }
}

impl<R> io::Read for LimitingReader<R> where R: io::Read {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        match self.limit.checked_sub(buf.len() as u64) {
            Some(new_limit) => {
                self.limit = new_limit;
                self.reader.read(buf)
            }
            None => {
                Err(io::Error::new(io::ErrorKind::InvalidInput, "Read limit exceeded"))
            }
        }
    }
}

lexfrl avatar Sep 20 '19 17:09 lexfrl

We should probably rewrite borsh to use slices instead of Read/Write. It might give us a significant performance boost. In that case, we could wrap the slice into LimitingReader instead.

MaksymZavershynskyi avatar Oct 22 '19 22:10 MaksymZavershynskyi

So when you are suggested to use LimitingReader it might have referred to two different use cases:

  1. Making sure that as we deserialize data structure the deserialized object is less than X bytes, even if serialized form is valid, e.g. not corrupted;
  2. Making sure that as we deserialize data structure an invalid serialized form (e.g. due to corruption) does not cause us to allocate Vec<> of length larger than the actual number of elements available in the serialized form.

As now discussed with @ilblackdragon protecting from (1) does not make sense. In most cases the data structure we are deserializing into is not many orders of magnitude larger than its serialized form, so if we allowed allocating serialized struct (e.g. when receiving it from RPC or network) then it is safe to allow its deserialization, because deserializers are either implemented in: a) borsh for primitive types; b) in nearcore for our types; so we know they are not doing something crazy. It is however up to a third-party developer who tries using our borsh to not implemented deserializer that decodes 10 bytes into 10GiB under some condition.

Protecting from (2) seems to be already done through the fixes like https://github.com/nearprotocol/borsh/pull/28/

We still need to one more time go over deserializers implemented in borsh and nearcore to make sure we are protected from (2). Action item for @fckt

MaksymZavershynskyi avatar Oct 23 '19 18:10 MaksymZavershynskyi

I just bumped into this. 42% of my application runtime is consumed by __rust_realloc due to the conservative capacity passed to Vec::with_capacity.

In my use case there's no risk of either (1) or (2) in @nearmax's comment. Perhaps the conservative behaviour could be configurable?

alecmocatta avatar Dec 20 '20 15:12 alecmocatta