arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-43535: [C++] Support the AWS S3 SSE-C encryption

Open ripplehang opened this issue 1 year ago • 54 comments
trafficstars

Rationale for this change

server-side encryption with customer-provided keys is an important security feature for aws s3, it's useful when user want to manager the encryption key themselves, say, they don't want the data to be exposed to the aws system admin, and ensure the object is safe even the ACCESS_KEY and SECRET_KEY is somehow leaked. Some comparison of S3 encryption options : https://www.linkedin.com/pulse/delusion-s3-encryption-benefits-ravi-ivaturi/

What changes are included in this PR?

  1. Add the sse_customer_key member for S3Options to support server-side encryption with customer-provided keys (SSE-C keys).

    • The sse_customer_key was expected to be 256 bits (32 bytes) according to aws doc
    • The sse_customer_key was expected to be the raw key rather than base64 encoded value, arrow would calculate the base64 and MD5 on the fly.
    • By default the sse_customer_key is empty, and when the sse_customer_key is empty, there is no impact on the existing workflow. When the sse_customer_key is configured, it would require the aws sdk version to newer than 1.9.201.
  2. Add the tls_ca_file_path, tls_ca_dir_path and tls_verify_certificates members for S3Options.

    • the tls_ca_file_path, tls_ca_dir_path member for S3Options would override the value configured by arrow::fs::FileSystemGlobalOptions.
    • for s3, according to aws sdk doc, the tls_ca_file_path and tls_ca_dir_path only take effect in Linux, in order to support connect to the the storage server like minio with self-signed certificates on non-linux platform, we expose the tls_verify_certificates.
  3. Refine the unit test to start the minio server with self-signed certificate on linux platform, so the unit test could cover the https case on linux, and http case on non-linux platform.

Are these changes tested?

Yes

Are there any user-facing changes?

yes, the new member sse_customer_key tls_ca_file_path, tls_ca_dir_path and tls_verify_certificates added for S3Options

  • GitHub Issue: #43535

ripplehang avatar Aug 07 '24 08:08 ripplehang

:warning: GitHub issue #43535 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Aug 07 '24 08:08 github-actions[bot]

@pitrou @mapleFU @felipecrv can you help to review the PR, Thanks

ripplehang avatar Aug 22 '24 08:08 ripplehang

I took a quick look and here are some assorted thoughts:

  • there are no unit tests
  • only the encryption key needs to be on S3Options, not the MD5 and algorithm string (which can be computed on the fly)

pitrou avatar Aug 22 '24 08:08 pitrou

Also, for the other readers wondering, I've found a useful (though possibly outdated?) comparison of S3 encryption options here: https://www.linkedin.com/pulse/delusion-s3-encryption-benefits-ravi-ivaturi/

pitrou avatar Aug 22 '24 08:08 pitrou

I took a quick look and here are some assorted thoughts:

@pitrou Thanks for your review, for #1, i would try to add ut for the change, for #2, yes, actually ONLY the Key is needed, that's why i only expose one SetSSECKey function, and three Get function, and private the new added datamember. do you suggest to only add one datamember for S3Options? but if there is only one memeber, we may needs to calculate everywhere?

ripplehang avatar Aug 23 '24 08:08 ripplehang

@pitrou I've submitted another commit to only expose the ssec-key in S3Options, and then calculate the MD5 every time on the fly, can you help to review whether there are other comments? I would try to add the ut if there is no interface change required, Thanks

ripplehang avatar Aug 23 '24 11:08 ripplehang

@pitrou Can you help to review the PR again?Thanks

ripplehang avatar Aug 27 '24 07:08 ripplehang

@kou @mapleFU can you help to review the change, Thanks

ripplehang avatar Aug 30 '24 09:08 ripplehang

In your PR, it appears that S3Options.sse_customer_key expects a base64-encoded private key. Instead, why not expect the raw 256-bit key?

pitrou avatar Sep 02 '24 12:09 pitrou

It appears that boto3 (the officiel Python bindings for AWS) expects the 32-byte raw key, example: https://github.com/boto/boto3/issues/1941#issuecomment-487211835

pitrou avatar Sep 02 '24 13:09 pitrou

In your PR, it appears that S3Options.sse_customer_key expects a base64-encoded private key. Instead, why not expect the raw 256-bit key?

@pitrou though the boto3 exposed the 32-bytes key, the offcial c++ aws sdk expected the base64 encoded key, I would prefer to just passthrough the input to the aws sdk. Meanwhile, during the integration between the arrow and my own project, using the base64encoded key may be a little more easy for integration/debug/troubleshooting purpose. Anyway, if you insist on exposing the 32 bit raw key, I'm also fine with that. @kou @mapleFU any insights?

ripplehang avatar Sep 03 '24 07:09 ripplehang

If we accept raw key instead of base64 encoded key, we don't need base64 encoded key validation in CalculateSSECKeyMD5(), right? If so, I like raw key approach.

Meanwhile, during the integration between the arrow and my own project, using the base64encoded key may be a little more easy for integration/debug/troubleshooting purpose.

BTW, why is this? Is it simply because your project already has base64 encoded key not raw key?

kou avatar Sep 03 '24 23:09 kou

If we accept raw key instead of base64 encoded key, we don't need base64 encoded key validation in CalculateSSECKeyMD5(), right? If so, I like raw key approach.

Meanwhile, during the integration between the arrow and my own project, using the base64encoded key may be a little more easy for integration/debug/troubleshooting purpose.

BTW, why is this? Is it simply because your project already has base64 encoded key not raw key?

@kou visible ascii string may has some benefits, some system may needs to use the environment variable or even the Registry/Configure fille to configure the key, though the usage for the configure fie to store such a key is NOT the security practice, there do has the possibility to use the environment variable to store such information. cc @pitrou

ripplehang avatar Sep 04 '24 01:09 ripplehang

@pitrou @kou @mapleFU any more comments for the PR? Thanks

ripplehang avatar Sep 09 '24 12:09 ripplehang

I'll be out until 9.21, sorry for delaying. I think kou and pitrou can well review this

mapleFU avatar Sep 09 '24 13:09 mapleFU

@kou visible ascii string may has some benefits, some system may needs to use the environment variable or even the Registry/Configure file to configure the key, though the usage for the configure fie to store such a key is NOT the security practice, there do has the possibility to use the environment variable to store such information.

OK. I feel that decoding base64 data is a responsibility of application side not Apache Arrow C++. Because how to manage key is also a responsibility of application side. Could you accept raw key not base64 encoded key?

kou avatar Sep 10 '24 02:09 kou

@kou @pitrou I've changed the code to expect the 32 bytes raw key as the input. please review again. Thanks

ripplehang avatar Sep 13 '24 06:09 ripplehang

@kou @pitrou May i know is there any other comments for the change?

ripplehang avatar Sep 20 '24 01:09 ripplehang

@kou Can you help to review and help to trigger the CI again? Meanwhile, I've rebased the main branch,thanks

ripplehang avatar Sep 20 '24 09:09 ripplehang

Could you rebase on main?

Could you check S3 filesystem related CI failures?

@kou The reason for the failure is, the minio test server is start with http, while the SSEC-Key would require https. do you know why the minio server is started with http rather than https? is there any concern for the self-signed certificate? I tend to start the minio server with self-signed certificate for the new test case only, and keep the current http case unchanged, do you think it's acceptable?Thanks

ripplehang avatar Sep 20 '24 10:09 ripplehang

I think it's ok to switch the tests to access Minio using HTTPS with a self-signed certificate.

pitrou avatar Sep 20 '24 11:09 pitrou

do you know why the minio server is started with http rather than https?

Just because it was easier :-) No particular reason otherwise.

pitrou avatar Sep 20 '24 11:09 pitrou

We can always use HTTPS Minio. Can we prepare a self-signed certificate with portable way? (Do we need openssl command for it?)

kou avatar Sep 20 '24 20:09 kou

FYI: I created https://github.com/apache/arrow-flight-sql-postgresql/blob/main/dev/prepare-tls.sh for other Arrow related project but this is not for Windows...

kou avatar Sep 20 '24 20:09 kou

There is no need to generate a self-signed cert everytime, it can just be stored in the repository.

pitrou avatar Sep 20 '24 21:09 pitrou

In general, it's not a good idea. If we add it to this repository, someone may report it a security vulnerability. We need to explain that it's not danger. FYI: GH-6399

(If we don't have another approach, we can use the approach.)

kou avatar Sep 21 '24 05:09 kou

If someone is silly enough to report a self-signed certificate used for testing purposes as a security vulnerability, then they deserve to be ignored.

CPython has stored a ton of fake certificates in its repository for decades and we don't get any complaints: https://github.com/python/cpython/tree/main/Lib/test/certdata

pitrou avatar Sep 21 '24 09:09 pitrou

@pitrou regarding to use the self-signed certificate file, i've serval questions that may needs your further confirm: where should the certificate file be stored? I didn't see a proper resource folder to store such information, could it be put it in the same folder as the test code, say arrow\cpp\src\arrow\filesystem\certs ? Or, do you think it's acceptable to hard code the static string(certificate's content) in the test code, and then write the pem file into the tempory directly(in this way, it could avoid the relative path handling in different platform) Thanks

ripplehang avatar Sep 24 '24 09:09 ripplehang

Or, do you think it's acceptable to hard code the static string(certificate's content) in the test code, and then write the pem file into the tempory directly

Ah, yes, that's a good idea. You will need two strings: one for the private key, one for the cert.

pitrou avatar Sep 24 '24 12:09 pitrou

Or, do you think it's acceptable to hard code the static string(certificate's content) in the test code, and then write the pem file into the tempory directly

Ah, yes, that's a good idea. You will need two strings: one for the private key, one for the cert.

yes, according to https://github.com/minio/minio/tree/0c53d860173e98f154ce9fc117725652bd90f840/docs/tls, the file name also need to be fixed. Since you have no concern on this approach, i would try to do it in this way, Thanks

ripplehang avatar Sep 25 '24 01:09 ripplehang