AVRO-4134: [C++] Allocate std::vector block by block when decoding
What is the purpose of the change
This pull request improves std::vector decoding by allocating memory block by block instead of a default memory allocation strategy.
Verifying this change
This change is already covered by existing tests, such as testArray() in test/SpecificTests.cc
Documentation
- Does this pull request introduce a new feature? no
Have you actually tested if this is actually faster? Calling reserve in a loop looks to me it could actually make it slower since it since it reserves exactly, leading to a potential reallocation every loop iteration, while push_back will exponentially over-reserve, leading to potentially less reallocation. See https://www.youtube.com/watch?v=algDLvbl1YY for an in-depth video about this.
The short and rigorous answer is : no.
However, it would be hard to make a clear answer about if it is faster or slower because, as often, it depends... For example, if one decides to encode a very large array in blocks of one item, this change would be dramatic. Inversely, if one decides to encode a very large array in a single block of all items, this change would be perfect. As thumb rule, more you use blocks in your array, and worst will be this change.
Now, if we try to guess what AVRO users will do (knowing that they can do exactly what they want in the end), I think the current AVRO C++ code encodes an array as a single block which is the "perfect" case for this change. See https://github.com/apache/avro/blob/9110c693767c1dde2665b2b57939333478b12036/lang/c%2B%2B/include/avro/Specific.hh#L238-L249 where we only see a single call to e.setItemCount(b.size());
Consequently, when using the current AVRO code to encode arrays, you end up with a single block array which will make the "problematic" decoding loop irrelevant.
I think that this change is clearly faster for the default usage of the AVRO C++ library. However, it has indeed drawbacks and can be worst for some exotic and/or proprietary and/or future array encodings.
Right, given the encoder this then makes sense.
Alternative you could iterate twice:
s.clear();
std::size_t new_size = 0;
for (std::size_t n = d.arrayStart(); n != 0; n = d.arrayNext()) {
new_size += n;
}
s.reserve(new_size);
for (std::size_t n = d.arrayStart(); n != 0; n = d.arrayNext()) {
for (std::size_t i = 0; i < n; ++i) {
T t;
avro::decode(d, t);
s.emplace_back(std::move(t));
}
}
But since I don't know how costly the arrayStart() method is, this might as well be even more expensive.
Indeed, this is a more general solution but it would need either to read twice the entire array or to "seekg" (i.e. to set the input position indicator of the input stream) several times which may in both cases impact performance. In both cases, we would need to seekg to the beginning of the array before the second iteration. We would also need to change the underlying InputStream to an SeekableInputStream if I understand correctly.
The change would look more heavy, further from the corresponding JIRA issue and, I don't know on top of my head if it would generally be faster or not. For a single block array which is the current encoding implementation, I guess it would be a little bit slower cause of the seekg cost.
Thanks for this contribution :D I'm not a C++ dev!
Can you tell me if this change should be cherry-picked to 1.12.x and/or 1.11.x for a new release, or is it for 1.13.0?
@stephanlachnit I see a :+1: -- are you happy with the PR?
First, many thanks for putting in the effort on the PR. But I have to ask, was this change analysed for performance? I hope so since that was the whole point of the issue I raised.
First, many thanks for putting in the effort on the PR. But I have to ask, was this change analysed for performance? I hope so since that was the whole point of the issue I raised.
Mentally "analysed" : yes on my side as discussed in my above post. Analysed implementing a benchmark : no on my side. To me it does not deserve the time I would have to spend on it. But happy if someone else makes it.
Can you tell me if this change should be cherry-picked to 1.12.x and/or 1.11.x for a new release, or is it for 1.13.0?
IMHO, it would be ideal to be cherry picked on other dev branches since it generally provide only advantages. But this is not required. I don't know if it exists official rules to decide if it deserves a cherry pick or not. If so I am happy to help answering questions about these rules which you can point me to.
IMO it should be 1.13.0 only. Preallocating like this makes it more vulnerable to malicious data that claims to have a huge number of array elements and then causes the library to allocate a lot of memory, despite the attacker not spending any resources to actually send that much data. https://github.com/apache/avro/pull/3403 disclaims responsibility on using the library on untrusted data but users of older branches, which were published before that text was added, may still be doing that. Moreover, https://issues.apache.org/jira/plugins/servlet/mobile#issue/AVRO-4134 is categorised as a trivial improvement, rather than a bug.
As the person who originally raised this issue, I want to chip in to agree with KalleOlavNiemitalo. It makes sense to do the change in or beyond the version where that documentation was changed, rather than earlier. I can wait for 1.13.0, that's fine.