datafusion
datafusion copied to clipboard
feat: Parquet modular encryption
Which issue does this PR close?
- Closes #15216.
What changes are included in this PR?
This PR adds support for encryption in DataFusion’s Parquet implementation. The changes introduce new configuration options for file encryption and decryption properties, update various components (including proto conversion, file reading/writing, and tests), and add an end-to-end encrypted Parquet example.
Are these changes tested?
Tests and examples have been added to demonstrate and test functionality versus Parquet modular encryption. These could use feedback since there may be additional DataFusion usage cases that should be covered.
Are there any user-facing changes?
Additional options have been added to allow encryption/decryption configuration. We are soliciting additional feedback on how to handle key columns in a way that best fits the existing API.
Catalog of changes via copilot
Show a summary per file
| File | Description |
|---|---|
| docs/source/user-guide/configs.md | Added documentation for new encryption and decryption properties. |
| datafusion/sqllogictest/test_files/information_schema.slt | Updated SQL logic tests to include encryption/decryption settings. |
| datafusion/proto/src/logical_plan/file_formats.rs | Added default None values for encryption properties in proto. |
| datafusion/proto-common/src/from_proto/mod.rs | Updated conversion to set encryption properties to None. |
| datafusion/datasource-parquet/src/source.rs | Passed file decryption properties from table options. |
| datafusion/datasource-parquet/src/opener.rs | Disabled page index when encryption is enabled. |
| datafusion/datasource-parquet/src/file_format.rs | Propagated decryption properties in schema and statistics fetching. |
| datafusion/core/tests/parquet/mod.rs | Added the encryption test module reference. |
| datafusion/core/tests/parquet/encryption.rs | Added tests for round-trip encryption. |
| datafusion/core/tests/parquet/custom_reader.rs | Updated custom reader to pass None for decryption properties. |
| datafusion/core/src/datasource/file_format/parquet.rs | Added extra parameters for decryption in metadata/statistics fetching. |
| datafusion/core/src/dataframe/parquet.rs | Added roundtrip test for encrypted Parquet files. |
| datafusion/common/src/file_options/parquet_writer.rs | Updated writer options to support encryption properties. |
| datafusion/common/Cargo.toml | Added the hex dependency. |
| datafusion-examples/examples/parquet_encrypted.rs | Introduced an example that reads/writes encrypted Parquet files. |
| datafusion-examples/README.md | Updated examples list to include the encrypted Parquet demo. |
| Cargo.toml | Enabled encryption feature for the parquet crate. |
@adamreeve @rok
@alamb One piece I would like to solicit feedback on is if there is a way to leverage the existing tests to more thoroughly vet encryption. What I mean by that, is that we uncovered a read bug when using filters in a query, and I worry that there could be other edge cases that might not be covered. What I would like to do is take an encrypted parquet file and then run the datafusion SQL tests over it (and maybe other operation tests). This would help to make sure that all the SQL operations are really covered. And maybe in addition, somehow double-check things like statistics and bloom filters? Anyway, I'm hoping there is a way to leverage the existing test suite to cover these cases. Any suggestions?
Thank you and @adamreeve for driving so much of the modular encryption work! I'll take a look at this branch this week and see how this might get Comet supporting modular encryption within Spark, or if any obvious gaps jump out at me.
I am sorry I haven't had a chance to review this yet. It would be great if @mbutrovich could also take a look. I have this on my list to review but I haven't been able to find the time yet
I've been experimenting with how this work could be extended to support more ways of configuring encryption beyond having fixed and known AES keys for all files. For example, data encryption keys are often randomly generated per file in multi-file datasets, and the keys are stored encrypted in the Parquet file's encryption metadata. I've got an example of how this could work that integrates with the parquet-key-management crate in a draft PR here if anyone is interested.
I've added a new EncryptionFactory trait for dynamically generating file encryption and decryption properties, and used a registry of these in the runtime environment to allow identifying the encryption factory with a string identifier for compatibility with string based configuration.
This should be a follow up PR rather than part of this PR, but I think it's worth mentioning here as this will require adding a separate way to configure encryption rather than using the new ConfigFileDecryptionProperties and ConfigFileEncryptionProperties types in this PR. In theory, using fixed AES keys could be implemented with an EncryptionFactory implementation, but the configuration for this is a bit clunky and opaque, so I think it makes sense to have more direct support for this simple scenario.
I am sorry I haven't had a chance to review this yet. It would be great if @mbutrovich could also take a look. I have this on my list to review but I haven't been able to find the time yet
I still owe this a look. I am traveling until July 7 unfortunately and likely won't get a chance to put it through its paces with Comet until after then (need to do some Comet work to get it working with this branch).
Thank you @corwinjoy and @adamreeve -- this PR was a joy to read and review. The code is clear, well commented, and well tested ❤️ 🏆
I think we should follow up with:
- Improve the documentation to include the format required for encryption/decryption properties
- Consider adding a
encyrptionor similar feature flag so people who don't want support for parquet encryption can avoid bringing along the dependencies
Thanks @alamb !! I have added a commit to improve the documentation and add a CLI example. (I could use some feedback on where to put the CLI example.) I will start looking into adding an encryption feature tomorrow.
I merged up to resolve a conflict
https://github.com/apache/datafusion/actions/runs/15936215612/job/44956521967?pr=16351
That is a new failure for me
Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-19519efa72c14141)
External error: 1 errors in file /Users/runner/work/datafusion/datafusion/datafusion/sqllogictest/test_files/union_by_name.slt
1. query is expected to fail, but actually succeed:
[SQL] select x, y, z from t3 union all by name select z, y, x, 'd' as zz from t3;
at /Users/runner/work/datafusion/datafusion/datafusion/sqllogictest/test_files/union_by_name.slt:343
I'll trigger it again and see if I can reproduce
Thanks again @corwinjoy / @adamreeve and everyone else. This is great
Thanks @alamb much appreciated for the review and helpful feedback! We hope to have a followup PR soon with a config to make encryption optional.
Thank you @corwinjoy