datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

feat: Parquet modular encryption

Open corwinjoy opened this issue 5 months ago • 4 comments

Which issue does this PR close?

  • Closes #15216.

What changes are included in this PR?

This PR adds support for encryption in DataFusion’s Parquet implementation. The changes introduce new configuration options for file encryption and decryption properties, update various components (including proto conversion, file reading/writing, and tests), and add an end-to-end encrypted Parquet example.

Are these changes tested?

Tests and examples have been added to demonstrate and test functionality versus Parquet modular encryption. These could use feedback since there may be additional DataFusion usage cases that should be covered.

Are there any user-facing changes?

Additional options have been added to allow encryption/decryption configuration. We are soliciting additional feedback on how to handle key columns in a way that best fits the existing API.

Catalog of changes via copilot

Show a summary per file
File Description
docs/source/user-guide/configs.md Added documentation for new encryption and decryption properties.
datafusion/sqllogictest/test_files/information_schema.slt Updated SQL logic tests to include encryption/decryption settings.
datafusion/proto/src/logical_plan/file_formats.rs Added default None values for encryption properties in proto.
datafusion/proto-common/src/from_proto/mod.rs Updated conversion to set encryption properties to None.
datafusion/datasource-parquet/src/source.rs Passed file decryption properties from table options.
datafusion/datasource-parquet/src/opener.rs Disabled page index when encryption is enabled.
datafusion/datasource-parquet/src/file_format.rs Propagated decryption properties in schema and statistics fetching.
datafusion/core/tests/parquet/mod.rs Added the encryption test module reference.
datafusion/core/tests/parquet/encryption.rs Added tests for round-trip encryption.
datafusion/core/tests/parquet/custom_reader.rs Updated custom reader to pass None for decryption properties.
datafusion/core/src/datasource/file_format/parquet.rs Added extra parameters for decryption in metadata/statistics fetching.
datafusion/core/src/dataframe/parquet.rs Added roundtrip test for encrypted Parquet files.
datafusion/common/src/file_options/parquet_writer.rs Updated writer options to support encryption properties.
datafusion/common/Cargo.toml Added the hex dependency.
datafusion-examples/examples/parquet_encrypted.rs Introduced an example that reads/writes encrypted Parquet files.
datafusion-examples/README.md Updated examples list to include the encrypted Parquet demo.
Cargo.toml Enabled encryption feature for the parquet crate.

corwinjoy avatar Jun 10 '25 00:06 corwinjoy

@adamreeve @rok

corwinjoy avatar Jun 10 '25 00:06 corwinjoy

@alamb One piece I would like to solicit feedback on is if there is a way to leverage the existing tests to more thoroughly vet encryption. What I mean by that, is that we uncovered a read bug when using filters in a query, and I worry that there could be other edge cases that might not be covered. What I would like to do is take an encrypted parquet file and then run the datafusion SQL tests over it (and maybe other operation tests). This would help to make sure that all the SQL operations are really covered. And maybe in addition, somehow double-check things like statistics and bloom filters? Anyway, I'm hoping there is a way to leverage the existing test suite to cover these cases. Any suggestions?

corwinjoy avatar Jun 12 '25 22:06 corwinjoy

Thank you and @adamreeve for driving so much of the modular encryption work! I'll take a look at this branch this week and see how this might get Comet supporting modular encryption within Spark, or if any obvious gaps jump out at me.

mbutrovich avatar Jun 16 '25 19:06 mbutrovich

I am sorry I haven't had a chance to review this yet. It would be great if @mbutrovich could also take a look. I have this on my list to review but I haven't been able to find the time yet

alamb avatar Jun 16 '25 21:06 alamb

I've been experimenting with how this work could be extended to support more ways of configuring encryption beyond having fixed and known AES keys for all files. For example, data encryption keys are often randomly generated per file in multi-file datasets, and the keys are stored encrypted in the Parquet file's encryption metadata. I've got an example of how this could work that integrates with the parquet-key-management crate in a draft PR here if anyone is interested.

I've added a new EncryptionFactory trait for dynamically generating file encryption and decryption properties, and used a registry of these in the runtime environment to allow identifying the encryption factory with a string identifier for compatibility with string based configuration.

This should be a follow up PR rather than part of this PR, but I think it's worth mentioning here as this will require adding a separate way to configure encryption rather than using the new ConfigFileDecryptionProperties and ConfigFileEncryptionProperties types in this PR. In theory, using fixed AES keys could be implemented with an EncryptionFactory implementation, but the configuration for this is a bit clunky and opaque, so I think it makes sense to have more direct support for this simple scenario.

adamreeve avatar Jun 24 '25 03:06 adamreeve

I am sorry I haven't had a chance to review this yet. It would be great if @mbutrovich could also take a look. I have this on my list to review but I haven't been able to find the time yet

I still owe this a look. I am traveling until July 7 unfortunately and likely won't get a chance to put it through its paces with Comet until after then (need to do some Comet work to get it working with this branch).

mbutrovich avatar Jun 24 '25 12:06 mbutrovich

Thank you @corwinjoy and @adamreeve -- this PR was a joy to read and review. The code is clear, well commented, and well tested ❤️ 🏆

I think we should follow up with:

  1. Improve the documentation to include the format required for encryption/decryption properties
  2. Consider adding a encyrption or similar feature flag so people who don't want support for parquet encryption can avoid bringing along the dependencies

Thanks @alamb !! I have added a commit to improve the documentation and add a CLI example. (I could use some feedback on where to put the CLI example.) I will start looking into adding an encryption feature tomorrow.

corwinjoy avatar Jun 25 '25 04:06 corwinjoy

I merged up to resolve a conflict

alamb avatar Jun 27 '25 20:06 alamb

https://github.com/apache/datafusion/actions/runs/15936215612/job/44956521967?pr=16351

That is a new failure for me

     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-19519efa72c14141)
External error: 1 errors in file /Users/runner/work/datafusion/datafusion/datafusion/sqllogictest/test_files/union_by_name.slt

1. query is expected to fail, but actually succeed:
[SQL] select x, y, z from t3 union all by name select z, y, x, 'd' as zz from t3;
at /Users/runner/work/datafusion/datafusion/datafusion/sqllogictest/test_files/union_by_name.slt:343

I'll trigger it again and see if I can reproduce

alamb avatar Jun 27 '25 21:06 alamb

Thanks again @corwinjoy / @adamreeve and everyone else. This is great

alamb avatar Jun 28 '25 13:06 alamb

Thanks @alamb much appreciated for the review and helpful feedback! We hope to have a followup PR soon with a config to make encryption optional.

corwinjoy avatar Jun 30 '25 22:06 corwinjoy

Thank you @corwinjoy

alamb avatar Jul 01 '25 18:07 alamb