avro-rs
avro-rs copied to clipboard
Reduce allocations in reading
This is a follow up of #193.
My understanding is that this crate is optimized for reading to (the owning enum) Value, that encapsulates all types, like e.g. serde-json does.
This design makes it expensive to deserialize it to other formats. Specifically, the main challenge is that given a rows of strings, each row will have to allocate a new String because Value owns String, even when the values were already allocated into a Vec<u8> during buffering + decompression (See Reader and Block). This happens to all non-copy types (String, Bytes, Fixed and Enum).
Formats such as arrow have their own way of storing strings. Thus, currently, using Value::String(String) adds a large overhead (I can quantify them if we need to) to interoperate. Specifically, the data path from a Read of a uncompressed file with 1 column of strings to Value is:
Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> String - [move] -> Value
This yields an extra allocation per row per column of non-copy types (String, Bytes, Fixed and Enum).
Usually, it is more efficient to read it to a Vec<u8> and produce non-owned values from it, in this case this would be
Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> &'a str -> Value<'a>
(there are other tricks to avoid re-allocating a Vec<u8> per Block + decompression which can be discussed in separate).
One way to address this is to declare a "non-owned" Value (e.g. ValueRef::String(&'a str)), and have the decoder to be of the form decode<'a>(&'a[u8]) -> Result<(ValueRef<'a>, usize)> where usize is the number of read bytes.
Since each row block has a declared number of bytes upfront and we read them to a buf: Vec<u8> as part of the buffering in Reader + decompression, we can leverage the existence of this buf to decode them to non-owning ValueRef, thereby avoiding the allocation per row per column.
We can offer impl<'a> From<ValueRef<'a>> for Value that calls to_string() (and the like to other types such as Bytes), if the user wishes to remove the lifetime at the cost of an allocation.
This is a pretty large, though, because we would need to also change the signatures of zag_i64 and the like to read from &[u8] instead of R: Read, since they must return how many bytes were consumed to "advance pointer of &[u8]".
I recently did this type of exercise as part of a complete re-write of the parquet crate parquet2, where this exact same problem emerges: a column is composed by pages (avro "blocks"), that are decompressed and decoded individually; it is advantageous to read each page to Vec<u8> before decompressing and decoding them, but we should not perform further allocations when not needed, specially for strings, bytes, etc.
The result of this re-write was a 5-10x speedup, which is why I am cautiously confident that this effort may be worth here too.
There's the same kind of suboptimal writing in the serde module, e.g. it uses visit_str instead of visit_borrowed_str. Looks like this (and possibly other issues) should be ported over to the jira repo.