arrow
arrow copied to clipboard
GH-43535: [C++] Support the AWS S3 SSE-C encryption
Rationale for this change
server-side encryption with customer-provided keys is an important security feature for aws s3, it's useful when user want to manager the encryption key themselves, say, they don't want the data to be exposed to the aws system admin, and ensure the object is safe even the ACCESS_KEY and SECRET_KEY is somehow leaked. Some comparison of S3 encryption options : https://www.linkedin.com/pulse/delusion-s3-encryption-benefits-ravi-ivaturi/
What changes are included in this PR?
-
Add the sse_customer_key member for S3Options to support server-side encryption with customer-provided keys (SSE-C keys).
- The sse_customer_key was expected to be 256 bits (32 bytes) according to aws doc
- The sse_customer_key was expected to be the raw key rather than base64 encoded value, arrow would calculate the base64 and MD5 on the fly.
- By default the sse_customer_key is empty, and when the sse_customer_key is empty, there is no impact on the existing workflow. When the sse_customer_key is configured, it would require the aws sdk version to newer than 1.9.201.
-
Add the tls_ca_file_path, tls_ca_dir_path and tls_verify_certificates members for S3Options.
- the tls_ca_file_path, tls_ca_dir_path member for S3Options would override the value configured by arrow::fs::FileSystemGlobalOptions.
- for s3, according to aws sdk doc, the tls_ca_file_path and tls_ca_dir_path only take effect in Linux, in order to support connect to the the storage server like minio with self-signed certificates on non-linux platform, we expose the tls_verify_certificates.
-
Refine the unit test to start the minio server with self-signed certificate on linux platform, so the unit test could cover the https case on linux, and http case on non-linux platform.
Are these changes tested?
Yes
Are there any user-facing changes?
yes, the new member sse_customer_key tls_ca_file_path, tls_ca_dir_path and tls_verify_certificates added for S3Options
- GitHub Issue: #43535
:warning: GitHub issue #43535 has been automatically assigned in GitHub to PR creator.
@pitrou @mapleFU @felipecrv can you help to review the PR, Thanks
I took a quick look and here are some assorted thoughts:
- there are no unit tests
- only the encryption key needs to be on S3Options, not the MD5 and algorithm string (which can be computed on the fly)
Also, for the other readers wondering, I've found a useful (though possibly outdated?) comparison of S3 encryption options here: https://www.linkedin.com/pulse/delusion-s3-encryption-benefits-ravi-ivaturi/
I took a quick look and here are some assorted thoughts:
@pitrou Thanks for your review, for #1, i would try to add ut for the change, for #2, yes, actually ONLY the Key is needed, that's why i only expose one SetSSECKey function, and three Get function, and private the new added datamember. do you suggest to only add one datamember for S3Options? but if there is only one memeber, we may needs to calculate everywhere?
@pitrou I've submitted another commit to only expose the ssec-key in S3Options, and then calculate the MD5 every time on the fly, can you help to review whether there are other comments? I would try to add the ut if there is no interface change required, Thanks
@pitrou Can you help to review the PR again?Thanks
@kou @mapleFU can you help to review the change, Thanks
In your PR, it appears that S3Options.sse_customer_key expects a base64-encoded private key. Instead, why not expect the raw 256-bit key?
It appears that boto3 (the officiel Python bindings for AWS) expects the 32-byte raw key, example: https://github.com/boto/boto3/issues/1941#issuecomment-487211835
In your PR, it appears that
S3Options.sse_customer_keyexpects a base64-encoded private key. Instead, why not expect the raw 256-bit key?
@pitrou though the boto3 exposed the 32-bytes key, the offcial c++ aws sdk expected the base64 encoded key, I would prefer to just passthrough the input to the aws sdk. Meanwhile, during the integration between the arrow and my own project, using the base64encoded key may be a little more easy for integration/debug/troubleshooting purpose. Anyway, if you insist on exposing the 32 bit raw key, I'm also fine with that. @kou @mapleFU any insights?
If we accept raw key instead of base64 encoded key, we don't need base64 encoded key validation in CalculateSSECKeyMD5(), right?
If so, I like raw key approach.
Meanwhile, during the integration between the arrow and my own project, using the base64encoded key may be a little more easy for integration/debug/troubleshooting purpose.
BTW, why is this? Is it simply because your project already has base64 encoded key not raw key?
If we accept raw key instead of base64 encoded key, we don't need base64 encoded key validation in
CalculateSSECKeyMD5(), right? If so, I like raw key approach.Meanwhile, during the integration between the arrow and my own project, using the base64encoded key may be a little more easy for integration/debug/troubleshooting purpose.
BTW, why is this? Is it simply because your project already has base64 encoded key not raw key?
@kou visible ascii string may has some benefits, some system may needs to use the environment variable or even the Registry/Configure fille to configure the key, though the usage for the configure fie to store such a key is NOT the security practice, there do has the possibility to use the environment variable to store such information. cc @pitrou
@pitrou @kou @mapleFU any more comments for the PR? Thanks
I'll be out until 9.21, sorry for delaying. I think kou and pitrou can well review this
@kou visible ascii string may has some benefits, some system may needs to use the environment variable or even the Registry/Configure file to configure the key, though the usage for the configure fie to store such a key is NOT the security practice, there do has the possibility to use the environment variable to store such information.
OK. I feel that decoding base64 data is a responsibility of application side not Apache Arrow C++. Because how to manage key is also a responsibility of application side. Could you accept raw key not base64 encoded key?
@kou @pitrou I've changed the code to expect the 32 bytes raw key as the input. please review again. Thanks
@kou @pitrou May i know is there any other comments for the change?
@kou Can you help to review and help to trigger the CI again? Meanwhile, I've rebased the main branch,thanks
Could you rebase on main?
Could you check S3 filesystem related CI failures?
@kou The reason for the failure is, the minio test server is start with http, while the SSEC-Key would require https. do you know why the minio server is started with http rather than https? is there any concern for the self-signed certificate? I tend to start the minio server with self-signed certificate for the new test case only, and keep the current http case unchanged, do you think it's acceptable?Thanks
I think it's ok to switch the tests to access Minio using HTTPS with a self-signed certificate.
do you know why the minio server is started with http rather than https?
Just because it was easier :-) No particular reason otherwise.
We can always use HTTPS Minio.
Can we prepare a self-signed certificate with portable way? (Do we need openssl command for it?)
FYI: I created https://github.com/apache/arrow-flight-sql-postgresql/blob/main/dev/prepare-tls.sh for other Arrow related project but this is not for Windows...
There is no need to generate a self-signed cert everytime, it can just be stored in the repository.
In general, it's not a good idea. If we add it to this repository, someone may report it a security vulnerability. We need to explain that it's not danger. FYI: GH-6399
(If we don't have another approach, we can use the approach.)
If someone is silly enough to report a self-signed certificate used for testing purposes as a security vulnerability, then they deserve to be ignored.
CPython has stored a ton of fake certificates in its repository for decades and we don't get any complaints: https://github.com/python/cpython/tree/main/Lib/test/certdata
@pitrou regarding to use the self-signed certificate file, i've serval questions that may needs your further confirm: where should the certificate file be stored? I didn't see a proper resource folder to store such information, could it be put it in the same folder as the test code, say arrow\cpp\src\arrow\filesystem\certs ? Or, do you think it's acceptable to hard code the static string(certificate's content) in the test code, and then write the pem file into the tempory directly(in this way, it could avoid the relative path handling in different platform) Thanks
Or, do you think it's acceptable to hard code the static string(certificate's content) in the test code, and then write the pem file into the tempory directly
Ah, yes, that's a good idea. You will need two strings: one for the private key, one for the cert.
Or, do you think it's acceptable to hard code the static string(certificate's content) in the test code, and then write the pem file into the tempory directly
Ah, yes, that's a good idea. You will need two strings: one for the private key, one for the cert.
yes, according to https://github.com/minio/minio/tree/0c53d860173e98f154ce9fc117725652bd90f840/docs/tls, the file name also need to be fixed. Since you have no concern on this approach, i would try to do it in this way, Thanks