arrow
arrow copied to clipboard
GH-39444: [C++][Parquet] Fix Segment Fault in Modular Encryption
Rationale for this change:
This pull request addresses a critical issue (GH-39444) in the C++/Python components of Parquet, specifically a segmentation fault occurring when processing encrypted datasets over 2^15 rows. The fix involves modifications in cpp/src/parquet/encryption/internal_file_decryptor.cc, particularly in InternalFileDecryptor::GetColumnDecryptor. The caching of the Decryptor object was removed to resolve the multithreading issue causing the segmentation fault and encryption failures.
What changes are included in this PR?
- Removal of
Decryptorobject caching inInternalFileDecryptor::GetColumnDecryptor. - Addition of two unit tests:
large_row_parquet_encrypt_test.ccfor C++ and an update totest_dataset_encryption.pywithtest_large_row_encryption_decryptionfor Python.
Are these changes tested?
Yes, the unit tests (large_row_parquet_encrypt_test.cc and test_large_row_encryption_decryption in test_dataset_encryption.py) have been added to ensure the reliability and effectiveness of these changes.
Are there any user-facing changes?
No significant user-facing changes, but the update significantly improves the backend stability and reliability of Parquet file handling. Calling DecryptionKeyRetriever::GetKey could be an expensive operation potentially involving network calls to key management servers.
- Closes: #39444
- GitHub Issue: #39444
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename the pull request title in the following format?
GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
In the case of PARQUET issues on JIRA the title also supports:
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
See also:
:warning: GitHub issue #39444 has been automatically assigned in GitHub to PR creator.
Looks like I am still getting a segmentation fault randomly in InternalFileDecryptor when WipeOut is called. Ill look into seeing if I can add a mutex around all_decryptors_
It looks like all but the C++ / AMD64 Windows 2019 C++17 AVX2 build was completed without issue. I am not sure what the issue is with the C++ / AMD64 Windows 2019 C++17 AVX2, build but it seems unrelated.
The C++ side looks good to me. I have also refactored the C++ tests.
Could you take a look as well? @jorisvandenbossche @westonpace @pitrou
@adamreeve @westonpace Do you want to take another pass?
@github-actions crossbow submit -g python -g cpp
Revision: abf8c40392d5639cb139d91bb5cc70660a8143d9
Submitted crossbow builds: ursacomputing/crossbow @ actions-e56b1c9519
Question: the caching was probably there for a reason, is it ok to remove it? Would it be beneficial to reimplement it in a thread-safe way? (the PR description mentions network requests)
Hmm, I was assuming the Cython compatibility issues were fixed? @raulcd
Hmm, I was assuming the Cython compatibility issues were fixed? @raulcd
That was my understanding, isn't this the fix? https://github.com/apache/arrow/issues/40386 I've added the above to 15.0.2.
Ah, that's because this PR needs rebasing. @tolleybot Can you please rebase/merge on git main?
@github-actions crossbow submit -g python -g cpp
Revision: c8f2cc461eeb3f4172aee0e7990bbe55b338001f
Submitted crossbow builds: ursacomputing/crossbow @ actions-c569d19f76
@raulcd Not a blocker, but if you ever need to do a RC3, this would be a good candidate for inclusion :-)
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit dd6d7288e41d5d5c8bc73dbcf96ddc601db009cc.
There were 8 benchmark results indicating a performance regression:
- Commit Run on
ursa-i9-9960xat 2024-03-13 20:30:16Z - and 6 more (see the report linked below)
The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.