GH-43070: [C++][Parquet] Check for valid ciphertext length to prevent segfault
Rationale for this change
See #43070
What changes are included in this PR?
Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type.
Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer.
Are these changes tested?
Yes I've added new unit tests for this.
Are there any user-facing changes?
No
- GitHub Issue: #43070
:warning: GitHub issue #43070 has been automatically assigned in GitHub to PR creator.
Ah it looks like all the Windows builds are failing as I'm using non-exported classes in the new tests. Would it make sense to add PARQUET_EXPORT to these, or should I not be testing these internal classes? I think it would be quite difficult to add tests for this change at a higher level.
The code LGTM, but I'm not familiar with decrypt module. So @wgtmac @pitrou for help
Another potential weak point is this: https://github.com/apache/arrow/blob/1da71ba41ce8844eb89935bce44e14d966a262e6/cpp/src/parquet/thrift_internal.h#L415-L417
and of course this part where we totally ignore the physical buffer length, letting the Decrypt function happily read past the end of the buffer:
https://github.com/apache/arrow/blob/1da71ba41ce8844eb89935bce44e14d966a262e6/cpp/src/parquet/thrift_internal.h#L419-L420
All in all, this warrants a refactor for sanity and robustness.
@pitrou, I think I've addressed all of your comments now thank you
- I've added extra validation of the buffer sizes in various places
- I've switched from raw pointers to
arrow::util::span(I think this is more appropriate thanstd::arraywhich is for fixed length arrays, but let me know if I've misunderstood) - The ciphertext size is no longer optional, although it's no longer checked for an exact match, the actual ciphertext length might be less than the buffer size
- I've pulled out the length reading and validation into a shared method
- I've refactored how the plaintext/ciphertext length conversions are handled by adding methods for these rather than adding or subtracting the size delta in consumer code
I haven't touched the AesEncryptor API at all, but it would probably make sense to follow up after this to at least change that to use arrow::util::span too for consistency.
@adamreeve I just want to let you know that I'm currently sick and may not be able to review this before the next week. Thanks for doing this!
OK no problem, thanks for letting me know, and I hope you're feeling better soon.
@mapleFU @wgtmac is there any possibility you could take a look on this? Otherwise this fix will have to miss the 17.0.0 release unfortunately
I'll try to review this today.
CI failures are unrelated. I will merge it shortly.
@raulcd Please feel free to port it to maint-17.0.0
(Gang says he has api error when run merge script so I merged this, lol)
Thanks both for jumping in on the review and thanks @adamreeve for the PR
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit eadeb743151f5b1ded3d04d9a65185ff872d288e.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them.