arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-39444: [C++][Parquet] Fix Segment Fault in Modular Encryption

Open tolleybot opened this issue 1 year ago • 14 comments

Rationale for this change:

This pull request addresses a critical issue (GH-39444) in the C++/Python components of Parquet, specifically a segmentation fault occurring when processing encrypted datasets over 2^15 rows. The fix involves modifications in cpp/src/parquet/encryption/internal_file_decryptor.cc, particularly in InternalFileDecryptor::GetColumnDecryptor. The caching of the Decryptor object was removed to resolve the multithreading issue causing the segmentation fault and encryption failures.

What changes are included in this PR?

  • Removal of Decryptor object caching in InternalFileDecryptor::GetColumnDecryptor.
  • Addition of two unit tests: large_row_parquet_encrypt_test.cc for C++ and an update to test_dataset_encryption.py with test_large_row_encryption_decryption for Python.

Are these changes tested?

Yes, the unit tests (large_row_parquet_encrypt_test.cc and test_large_row_encryption_decryption in test_dataset_encryption.py) have been added to ensure the reliability and effectiveness of these changes.

Are there any user-facing changes?

No significant user-facing changes, but the update significantly improves the backend stability and reliability of Parquet file handling. Calling DecryptionKeyRetriever::GetKey could be an expensive operation potentially involving network calls to key management servers.

  • Closes: #39444
  • GitHub Issue: #39444

tolleybot avatar Jan 15 '24 21:01 tolleybot

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Jan 15 '24 21:01 github-actions[bot]

:warning: GitHub issue #39444 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Jan 16 '24 00:01 github-actions[bot]

Looks like I am still getting a segmentation fault randomly in InternalFileDecryptor when WipeOut is called. Ill look into seeing if I can add a mutex around all_decryptors_

tolleybot avatar Jan 16 '24 14:01 tolleybot

It looks like all but the C++ / AMD64 Windows 2019 C++17 AVX2 build was completed without issue. I am not sure what the issue is with the C++ / AMD64 Windows 2019 C++17 AVX2, build but it seems unrelated.

tolleybot avatar Jan 16 '24 18:01 tolleybot

The C++ side looks good to me. I have also refactored the C++ tests.

Could you take a look as well? @jorisvandenbossche @westonpace @pitrou

wgtmac avatar Jan 29 '24 15:01 wgtmac

@adamreeve @westonpace Do you want to take another pass?

wgtmac avatar Mar 12 '24 04:03 wgtmac

@github-actions crossbow submit -g python -g cpp

pitrou avatar Mar 12 '24 15:03 pitrou

Revision: abf8c40392d5639cb139d91bb5cc70660a8143d9

Submitted crossbow builds: ursacomputing/crossbow @ actions-e56b1c9519

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-cpp GitHub Actions
test-cuda-python GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-debian-11-python-3 Azure
test-fedora-38-cpp GitHub Actions
test-fedora-38-python-3 Azure
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions

github-actions[bot] avatar Mar 12 '24 15:03 github-actions[bot]

Question: the caching was probably there for a reason, is it ok to remove it? Would it be beneficial to reimplement it in a thread-safe way? (the PR description mentions network requests)

pitrou avatar Mar 12 '24 15:03 pitrou

Hmm, I was assuming the Cython compatibility issues were fixed? @raulcd

pitrou avatar Mar 12 '24 15:03 pitrou

Hmm, I was assuming the Cython compatibility issues were fixed? @raulcd

That was my understanding, isn't this the fix? https://github.com/apache/arrow/issues/40386 I've added the above to 15.0.2.

raulcd avatar Mar 12 '24 15:03 raulcd

Ah, that's because this PR needs rebasing. @tolleybot Can you please rebase/merge on git main?

pitrou avatar Mar 12 '24 16:03 pitrou

@github-actions crossbow submit -g python -g cpp

pitrou avatar Mar 12 '24 18:03 pitrou

Revision: c8f2cc461eeb3f4172aee0e7990bbe55b338001f

Submitted crossbow builds: ursacomputing/crossbow @ actions-c569d19f76

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-cpp GitHub Actions
test-cuda-python GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-debian-11-python-3-amd64 Azure
test-debian-11-python-3-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-fedora-39-python-3 Azure
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

github-actions[bot] avatar Mar 12 '24 18:03 github-actions[bot]

@raulcd Not a blocker, but if you ever need to do a RC3, this would be a good candidate for inclusion :-)

pitrou avatar Mar 13 '24 14:03 pitrou

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit dd6d7288e41d5d5c8bc73dbcf96ddc601db009cc.

There were 8 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.