arrow GH-43070: [C++][Parquet] Check for valid ciphertext length to prevent segfault

Rationale for this change

See #43070

What changes are included in this PR?

Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type.

Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer.

Are these changes tested?

Yes I've added new unit tests for this.

Are there any user-facing changes?

No

GitHub Issue: #43070

Jun 26 '24 23:06 adamreeve

:warning: GitHub issue #43070 has been automatically assigned in GitHub to PR creator.

Jun 26 '24 23:06 github-actions[bot]

Ah it looks like all the Windows builds are failing as I'm using non-exported classes in the new tests. Would it make sense to add PARQUET_EXPORT to these, or should I not be testing these internal classes? I think it would be quite difficult to add tests for this change at a higher level.

Jun 27 '24 00:06 adamreeve

The code LGTM, but I'm not familiar with decrypt module. So @wgtmac @pitrou for help

Jun 27 '24 02:06 mapleFU

Another potential weak point is this: https://github.com/apache/arrow/blob/1da71ba41ce8844eb89935bce44e14d966a262e6/cpp/src/parquet/thrift_internal.h#L415-L417

and of course this part where we totally ignore the physical buffer length, letting the Decrypt function happily read past the end of the buffer: https://github.com/apache/arrow/blob/1da71ba41ce8844eb89935bce44e14d966a262e6/cpp/src/parquet/thrift_internal.h#L419-L420

All in all, this warrants a refactor for sanity and robustness.

Jun 27 '24 07:06 pitrou

@pitrou, I think I've addressed all of your comments now thank you

I've added extra validation of the buffer sizes in various places
I've switched from raw pointers to arrow::util::span (I think this is more appropriate than std::array which is for fixed length arrays, but let me know if I've misunderstood)
The ciphertext size is no longer optional, although it's no longer checked for an exact match, the actual ciphertext length might be less than the buffer size
I've pulled out the length reading and validation into a shared method
I've refactored how the plaintext/ciphertext length conversions are handled by adding methods for these rather than adding or subtracting the size delta in consumer code

I haven't touched the AesEncryptor API at all, but it would probably make sense to follow up after this to at least change that to use arrow::util::span too for consistency.

Jul 03 '24 04:07 adamreeve

@adamreeve I just want to let you know that I'm currently sick and may not be able to review this before the next week. Thanks for doing this!

Jul 03 '24 09:07 pitrou

OK no problem, thanks for letting me know, and I hope you're feeling better soon.

Jul 03 '24 09:07 adamreeve

@mapleFU @wgtmac is there any possibility you could take a look on this? Otherwise this fix will have to miss the 17.0.0 release unfortunately

Jul 03 '24 09:07 raulcd

I'll try to review this today.

Jul 03 '24 10:07 wgtmac

CI failures are unrelated. I will merge it shortly.

Jul 04 '24 05:07 wgtmac

@raulcd Please feel free to port it to maint-17.0.0

Jul 04 '24 05:07 wgtmac

(Gang says he has api error when run merge script so I merged this, lol)

Jul 04 '24 05:07 mapleFU

Thanks both for jumping in on the review and thanks @adamreeve for the PR

Jul 04 '24 08:07 raulcd

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit eadeb743151f5b1ded3d04d9a65185ff872d288e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them.

Jul 04 '24 13:07 conbench-apache-arrow[bot]