GH-41719: [C++][Parquet] Cannot read encrypted parquet datasets via _metadata file
Rationale for this change
Metadata written into _metadata file appears to not be encrypted.
What changes are included in this PR?
This adds a code path to encrypt _metadata file and a test.
Are these changes tested?
Yes
Are there any user-facing changes?
This adds user facing encryption_properties parameter to pyarrow.parquet.write_metadata.
- GitHub Issue: #41719
:warning: GitHub issue #41719 has been automatically assigned in GitHub to PR creator.
@AudriusButkevicius could you check if the read metadata contains expected properties?
Hey, thanks for working on this, looks great, however seems that something is missing.
You can read the metadata file, but seems no files are actually read when reading the dataset via the metadata file, so the table ends up empty (but with the right schema).
(arrow) root@base ~/code/arrow # python testx.py
/tmp/tmp8qr3xv_t
Writing
pyarrow.Table
col1: int64
col2: int64
year: int64
----
col1: [[1,2,3]]
col2: [[1,2,3]]
year: [[2020,2020,2021]]
write done
Reading
pyarrow.Table
col1: int64
col2: int64
year: int16
----
col1: []
col2: []
year: []
Code that reproduces the issue
Test code
import os
import tempfile
import pyarrow.parquet.encryption as pe
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow as pa
import base64
import polars as pl
class KmsClient(pe.KmsClient):
def unwrap_key(self, wrapped_key, master_key_identifier):
return base64.b64decode(wrapped_key)
def wrap_key(self, key_bytes, master_key_identifier):
return base64.b64encode(key_bytes)
def write(location):
cf = pe.CryptoFactory(lambda *a, **k: KmsClient())
df = pl.DataFrame({
"col1": [1, 2, 3],
"col2": [1, 2, 3],
"year": [2020, 2020, 2021]
})
ecfg = pe.EncryptionConfiguration(
footer_key="TEST",
column_keys={
"TEST": ["col2"]
},
double_wrapping=False,
plaintext_footer=False,
)
table = df.to_arrow()
print("Writing")
print(table)
parquet_encryption_cfg = ds.ParquetEncryptionConfig(
cf, pe.KmsConnectionConfig(), ecfg
)
metadata_collector = []
pq.write_to_dataset(
table,
location,
partitioning=ds.partitioning(
schema=pa.schema([
pa.field("year", pa.int16())
]),
flavor="hive"
),
encryption_config=parquet_encryption_cfg,
metadata_collector=metadata_collector
)
encryption_properties = cf.file_encryption_properties(pe.KmsConnectionConfig(), ecfg)
pq.write_metadata(
pa.schema(
field
for field in table.schema
if field.name != "year"
),
os.path.join(location, "_metadata"),
metadata_collector,
encryption_properties=encryption_properties,
)
print("write done")
def read(location):
decryption_config = pe.DecryptionConfiguration(cache_lifetime=300)
kms_connection_config = pe.KmsConnectionConfig()
cf = pe.CryptoFactory(lambda *a, **k: KmsClient())
parquet_decryption_cfg = ds.ParquetDecryptionConfig(
cf, kms_connection_config, decryption_config
)
decryption_properties = cf.file_decryption_properties(
kms_connection_config, decryption_config)
pq_scan_opts = ds.ParquetFragmentScanOptions(
decryption_config=parquet_decryption_cfg,
# If using build from master
decryption_properties=decryption_properties
)
pformat = pa.dataset.ParquetFileFormat(default_fragment_scan_options=pq_scan_opts)
dataset = ds.parquet_dataset(
os.path.join(location, "_metadata"),
format=pformat,
partitioning=ds.partitioning(
schema=pa.schema([
pa.field("year", pa.int16())
]),
flavor="hive"
)
)
print("Reading")
print(dataset.to_table())
if __name__ == '__main__':
location = tempfile.mkdtemp(suffix=None, prefix=None, dir=None)
print(location)
os.makedirs(location, exist_ok=True)
write(location)
print("\n")
read(location)
Seems that dataset.get_fragments() doesn't return anything.
Seems that dataset.get_fragments() doesn't return anything.
I think c++ logic as-is doesn't store row groups, let me take a look and get back.
I think the whole premise of _metadata files (not _common_metadata) is to store row group details as well as paths, so when you perform a read via the _metadata file, it knows exactly which files and row groups to read without having to open every file in the dataset. At least this is what happens when encryption is disabled.
Although I understand the intention of this issue and the corresponding fix, I don't think the design of parquet encryption has included the _metadata summary file because it may point to several different encrypted parquet files. It would be great if @ggershinsky could advise to see if there is any defection in this use case.
BTW, _metadata summary file is something that we strive to deprecate in the new parquet v3 design. If you have any concrete use case, please let us know to see if we can improve. @AudriusButkevicius
I don't think the design of parquet encryption has included the
_metadatasummary file because it may point to several different encrypted parquet files.
Why would this be an issue? Because these files might be encrypted with different keys?
Yes, these files may have different master keys, which are unable to be referenced by a single footer IMHO.
But the files can have the same key and decryption would error if not? That sounds ok to me.
Python error is thrown here, but I'm not sure why.
My intended use of this is to reduce strain on the filesystem when reading large (many files) datasets from a network attached filesystem, by reading the metadata file instead of many separate files.
I also have a hard requirement for encryption sadly as the data is sensitive.
It would be amazing if this worked with encrypted datasets assuming the key is the same.
I would also be ok storing the metadata in plaintext, perform fragment filtering based on row-group stats, and then re-read and decrypt footers of the chosen files. Obviously this is ok for my usecase but generally might not be ok.
I didn't get any error when reading, seems that it just returns no data.
For the record: the community is inclined to deprecate ColumnChunk.file_path: https://lists.apache.org/thread/qprmj710yq92ybyt89m5xgtqyz3o3st2 and https://github.com/apache/parquet-format/pull/242/files#r1603234502
I'm not sure if we want to support and maintain this. cc @emkornfield @pitrou @mapleFU
Left my 2c there. Explained why I would be sad if that happened, and would probably have to re-implement the same feature.
@wgtmac I'm not sure I follow. We already have WriteMetaDataFile to produce _metadata file and if we just add WriteEncryptedMetadataFile to produce encrypted _metadata files we're not really adding additional additional complexity (outside of WriteEncryptedMetadataFile) or am I missing something? :)
Although I understand the intention of this issue and the corresponding fix, I don't think the design of parquet encryption has included the
_metadatasummary file because it may point to several different encrypted parquet files. It would be great if @ggershinsky could advise to see if there is any defection in this use case.
Yep, we haven't worked on supporting this (basically, there was no requirement; seemed heading towards deprecation).
In general, using different encryption keys for different data files is considered to be a good security practice (mainly because there is a limit on number of crypto operations with one key; also, the key leak scope is smaller) - that's why we generate a fresh key for each parquet file in most of the APIs (Parquet, Arrow, Spark, Iceberg, etc). However, there are obviously some low-level parquet APIs that will allow to pass the same key to many files - if used carefully (making sure, somehow, not to exceed the limit), this might be ok in some cases. The limit is hight (~1 billion pages, somethings like 10TB-1PB of data), but if exceeded, the cipher breaks and the data can be decrypted.
Another option could be to create a separate key for the _metadata summary file, and manage it separately from the data file keys.
mainly because there is a limit on number of crypto operations with one key
What is the theoretical limit, assuming a 256-bit AES key?
Also, if column key encryption is used, wouldn't the limit basically become irrelevant?
mainly because there is a limit on number of crypto operations with one key
What is the theoretical limit, assuming a 256-bit AES key?
~ a billion crypto operations (meaning a billion parquet pages)
Also, if column key encryption is used, wouldn't the limit basically become irrelevant?
Yes, if columns are encrypted with column-specific keys. If a column is encrypted in a "footer key" mode, the counter applies. The cleanest general solution would be to create and manage a separate "summary file" key (eg keeping it - wrapped by a master key in KMS - in the summary file itself). Lacking that, a workaround that somehow applies the same footer key across all nodes / workers, would be safe enough under certain conditions. This workaround won't be compatible with the higher-level Parquet/Arrow/etc APIs that generate a new footer key for each data file.
~ a billion crypto operations (meaning a billion parquet pages)
Out of curiosity, what is the underlying cause?
Perhaps I am wrong, but I assumed that metadata collector collected metadata that was already encrypted, and the issue mostly stems from the fact that when we write out metadata file, we don't set the file level encryption algorithm in the metadata file, which then causes reads of the metadata to fail as it doesn't expect encrypted metadata.
We could require to pass decryption and encryption properties to write_metadata which decrypts and re-encrypts, or just set the right properties in the file footer to indicate that its encrypted (this is assuming that the collected metadata encryption keys are self contained and do not need per-file footer keys)
~ a billion crypto operations (meaning a billion parquet pages)
Out of curiosity, what is the underlying cause?
The AES GCM math - when an attacker has 2^32 blocks, encrypted with the same key (and random nonces) - the key can be derived from this information. Since the vast majority of blocks in a parquet files are the page headers and the pages, the limit is a couple of billion of data pages.
Perhaps I am wrong, but I assumed that metadata collector collected metadata that was already encrypted, and the issue mostly stems from the fact that when we write out metadata file, we don't set the file level encryption algorithm in the metadata file, which then causes reads of the metadata to fail as it doesn't expect encrypted metadata.
We could require to pass decryption and encryption properties to write_metadata which decrypts and re-encrypts, or just set the right properties in the file footer to indicate that its encrypted (this is assuming that the collected metadata encryption keys are self contained and do not need per-file footer keys)
I'm not familiar with how the collector works and how the metadata files are written/read (is there a writeup on this?). Regarding parquet file encryption - the footer can be encrypted or unencrypted; inside the footer, the column metadata (stats etc) is encrypted with e.g. a column-specific key as required for its protection. The details are here.
The AES GCM math - when an attacker has 2^32 blocks, encrypted with the same key (and random nonces) - the key can be derived from this information. Since the vast majority of blocks in a parquet files are the page headers and the pages, the limit is a couple of billion of data pages.
Ok, I'm reading through the NIST 800-38D document (section 8), and this is apparently a statistical limit to ensure the nonce reuse probability stays below 2^-32. Am I right?
In any case, it would be nice if such limits were spelled out in https://github.com/apache/parquet-format/blob/master/Encryption.md
The AES GCM math - when an attacker has 2^32 blocks, encrypted with the same key (and random nonces) - the key can be derived from this information. Since the vast majority of blocks in a parquet files are the page headers and the pages, the limit is a couple of billion of data pages.
Ok, I'm reading through the NIST 800-38D document (section 8), and this is apparently a statistical limit to ensure the nonce reuse probability stays below 2^-32. Am I right?
Yep. This is the official reference.
In any case, it would be nice if such limits were spelled out in https://github.com/apache/parquet-format/blob/master/Encryption.md
Agreed. I'll send a patch.
Can we agree this proposal makes sense in principle?
Assuming the collector can decrypt all files and encrypt them with a new key into a _metadata file of course.
Can we agree this proposal makes sense in principle? Assuming the collector can decrypt all files and encrypt them with a new key into a
_metadatafile of course.
My 2c. If I understand the proposal correctly, it does not require all parquet files to be encrypted with the same data keys. Instead, the collector process can decrypt a footer (and column metadata) of any parquet file, regardless of their data keys, because for example the collector is authorized for the footer and column master keys. Technically, this is done by getting a decryption properties object from the relevant crypto factory. Then, the collector uses the same crypto factory to create a new encryption properties object (that has a footer and column data keys, as required) - and applies this object to all collected footers when writing them to the metadata file. Therefore, the future readers can (or cannot) read those encrypted footers and column metadata/stats according to the reader authorization (checked automatically when calling the crypto factory, as usual). If what I wrote is accurate, then the proposal sounds very good to me. A couple of technical points.
- On the reader side - if a reader is not authorized for a certain column, then the column_metadata/stats for this column should not be extracted from the metadata file. In other words, column metadata decryption should be reactive - done only if the column is projected in the reader query.
- If the parquet file encryption is configured with the "external key material" mode - then we need to make sure this mode works ok for the metadata file writing/reading. Maybe this mode can be simply turned off for the metadata files.
Thanks @ggershinsky, that's useful and encouraging!