hadoop icon indicating copy to clipboard operation
hadoop copied to clipboard

HADOOP-18708: Support S3 Client Side Encryption(CSE) With AWS SDK V2

Open shameersss1 opened this issue 1 year ago • 44 comments

Description of PR

This commit adds support for S3 client side encryption (CSE). CSE can configured in two modes CSE-KMS where keys are provided by AWS KMS and CSE-CUSTOM where custom keys are provided by implementing custom keyring

CSE is implemented using S3EncryptionClient (V3 client) and additional configurations (mentioned below) were added to make it compatible with the older encryption client V1 and V2 which is turned OFF by default.

Inorder to have compatibility with V1 client the following operations are done.

  1. V1 client pads extra bytes in multiple of 16 i.e if the file size is 12 bytes, 4bytes are padded to make it multiple of 16. Inorder to get the unencrypted file size of such S3 object ranged S3 GET call is made
  2. V1/V2 client supports storing encrypted metadata in instruction file (.instruction) and hence those files are skipped during listing.
  3. Unlike V1/V2 client V3 client does not support reading unencrypted object, Additional s3 client (base client) is introduced to read mix of encrypted and unencrypted s3 objects.

Default Behavior

The configurations to make it backward compatible is turned OFF by default considering the performance implications. The default behavior is as follows

  1. The unencrypted file size is computed by simply subtracting 16 bytes from the file size.
  2. When there is a mix of unencrypted and encrypted s3 objects, The client fails.

This PR is based on the initial work done by @ahmarsuhail as part of https://github.com/apache/hadoop/pull/6164

How was this patch tested?

  1. Tested in us-east-1 with mvn -Dparallel-tests -DtestsThreadCount=16 clean verify.
  2. Added integration test for CSE-KMS and CSE-CUSTOM

shameersss1 avatar Jun 12 '24 10:06 shameersss1

@steveloughran @ahmarsuhail Could you please review the changes ?

shameersss1 avatar Jun 13 '24 05:06 shameersss1

FYI, Jira on create performance. I'm worrying about how to remove/delay client configuration, so adding a new client makes things way worse: HADOOP-19205

steveloughran avatar Jun 14 '24 16:06 steveloughran

FYI, Jira on create performance. I'm worrying about how to remove/delay client configuration, so adding a new client makes things way worse: HADOOP-19205

by default we are not initializing new s3 client. It is done under a config to provide backward compatibility only when required.

shameersss1 avatar Jun 14 '24 17:06 shameersss1

It's still happening in FileSystem.initialize().

If I get time this week I'll do my PoC of lazy creation of transfer manager and async client, as that's doubled startup time already. all of that will be moved behind S3Store with only the copy() methods exposed.

Ultimately I want the S3Client itself hidden behind that API, so here

  • new unencryptedS3 operations getLengthOfEncryptedObject(path), ..., would trigger demand creation of the method
  • accessor to this passed down.

Actually, createFileStatusFromListingEntry() could be part of S3AStore took, somehow. all the config information for the creation, especially cSE flags, would be in the store config, so listing wouldn't need it

steveloughran avatar Jun 17 '24 09:06 steveloughran

#6892 is the draft pr with lazy creation; a new s3 client would go in there, but ideally we'd actually push the listing/transformation code into S3AStore so the extra client would never be visible outside it.

steveloughran avatar Jun 18 '24 16:06 steveloughran

#6892 is the draft pr with lazy creation; a new s3 client would go in there, but ideally we'd actually push the listing/transformation code into S3AStore so the extra client would never be visible outside it.

@steveloughran - I really like the idea of lazy initializing async client and new interface for creating s3 client. I will wait for your changes to get in and refactor my code based on it. By this way i could lazy initialize unencryted s3 client as well when the config is enabled.

shameersss1 avatar Jun 19 '24 05:06 shameersss1

exactly. when its not needed: no penalty.

steveloughran avatar Jun 19 '24 11:06 steveloughran

@steveloughran @ahmarsuhail I have rebased the changes on top of https://github.com/apache/hadoop/pull/6892 . Please review the changes

Thanks

shameersss1 avatar Jul 08 '24 07:07 shameersss1

@steveloughran I really appreciate your time and effort to review this changes. Indeed it was a detailed review. I have tried to address all your comments and raised a new revision (I hope i have covered all your comments).

Please take a look

Thanks

shameersss1 avatar Jul 13 '24 06:07 shameersss1

@shameersss1 unless you have no choice, please try not to rebase and force push prs once reviews have started. makes it a lot harder to see what has changed

steveloughran avatar Jul 15 '24 13:07 steveloughran

@shameersss1 unless you have no choice, please try not to rebase and force push prs once reviews have started. makes it a lot harder to see what has changed

Yeah, I had to rebase for merge conflicts in hadoop-common module. This commit (https://github.com/apache/hadoop/pull/6884/commits/838dc24c88cb1d20503c19f99c2886bcbb2c1e25) contains the addressing of review comments.

shameersss1 avatar Jul 22 '24 04:07 shameersss1

@steveloughran - Gentle reminder for the review Thanks

shameersss1 avatar Jul 28 '24 06:07 shameersss1

@shameersss1 what do you think of @raphaelazzolini 's comment

steveloughran avatar Aug 01 '24 10:08 steveloughran

@shameersss1 what do you think of @raphaelazzolini 's comment

I understand @raphaelazzolini 's concern of having many isCSEEnabled conditions especially in the listing methods. If i understand it correctly the ask is to introduce to handler interface pattern as follows

`interface ListingHandler { Listing getListing(Path path, boolean includeSelf); }

class CSEEnabledListingHandler implements ListingHandler { @Override public Listing getListing(Path path, boolean includeSelf) { return includeSelf ? new Listing.AcceptAllButS3nDirsAndCSEInstructionFile() : new Listing.AcceptAllButSelfAndS3nDirsAndCSEInstructionFile(path); } }

class CSEDisabledListingHandler implements ListingHandler { @Override public Listing getListing(Path path, boolean includeSelf) { return includeSelf ? new Listing.AcceptAllButS3nDirs() : new Listing.AcceptAllButSelfAndS3nDirs(path); } }

@Retries.RetryTranslated public RemoteIterator<S3ALocatedFileStatus> listFilesAndDirectoryMarkers( final Path path, final S3AFileStatus status, final boolean includeSelf) throws IOException { auditSpan.activate(); ListingHandler handler = isCSEEnabled ? new CSEEnabledListingHandler() : new CSEDisabledListingHandler(); return innerListFiles(path, true, handler.getListing(path, includeSelf), status); } `

I agree handler interface strategy will be more elegant. I will make the required code changes. I see there are conflicting changes which requires rebase and force push

shameersss1 avatar Aug 02 '24 06:08 shameersss1

@steveloughran - I have raised a new revision. Please do review the same. Thanks

shameersss1 avatar Aug 02 '24 10:08 shameersss1

:confetti_ball: +1 overall

Vote Subsystem Runtime Logfile Comment
+0 :ok: reexec 17m 3s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 1s No case conflicting files found.
+0 :ok: codespell 0m 0s codespell was not available.
+0 :ok: detsecrets 0m 0s detect-secrets was not available.
+0 :ok: xmllint 0m 0s xmllint was not available.
+0 :ok: markdownlint 0m 0s markdownlint was not available.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
+1 :green_heart: test4tests 0m 0s The patch appears to include 10 new or modified test files.
_ trunk Compile Tests _
+0 :ok: mvndep 14m 21s Maven dependency ordering for branch
+1 :green_heart: mvninstall 32m 35s trunk passed
+1 :green_heart: compile 17m 32s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 :green_heart: compile 16m 1s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 :green_heart: checkstyle 4m 30s trunk passed
+1 :green_heart: mvnsite 3m 23s trunk passed
+1 :green_heart: javadoc 2m 35s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 :green_heart: javadoc 2m 22s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+0 :ok: spotbugs 0m 44s branch/hadoop-project no spotbugs output file (spotbugsXml.xml)
+1 :green_heart: shadedclient 34m 46s branch has no errors when building and testing our client artifacts.
-0 :warning: patch 35m 12s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 :ok: mvndep 1m 7s Maven dependency ordering for patch
+1 :green_heart: mvninstall 1m 45s the patch passed
+1 :green_heart: compile 16m 48s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 :green_heart: javac 16m 48s the patch passed
+1 :green_heart: compile 16m 21s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 :green_heart: javac 16m 21s the patch passed
+1 :green_heart: blanks 0m 0s The patch has no blanks issues.
-0 :warning: checkstyle 4m 21s /results-checkstyle-root.txt root: The patch generated 2 new + 26 unchanged - 0 fixed = 28 total (was 26)
+1 :green_heart: mvnsite 3m 23s the patch passed
+1 :green_heart: javadoc 2m 31s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 :green_heart: javadoc 2m 21s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+0 :ok: spotbugs 0m 38s hadoop-project has no data from spotbugs
+1 :green_heart: shadedclient 35m 4s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 :green_heart: unit 0m 38s hadoop-project in the patch passed.
+1 :green_heart: unit 19m 37s hadoop-common in the patch passed.
+1 :green_heart: unit 2m 58s hadoop-aws in the patch passed.
+1 :green_heart: asflicense 1m 6s The patch does not generate ASF License warnings.
267m 6s
Subsystem Report/Notes
Docker ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6884/8/artifact/out/Dockerfile
GITHUB PR https://github.com/apache/hadoop/pull/6884
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint
uname Linux c24e8baf0f29 5.15.0-106-generic #116-Ubuntu SMP Wed Apr 17 09:17:56 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 069ec479c1aaa76f99acc99eeca572990d1da3ba
Default Java Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6884/8/testReport/
Max. process+thread count 2742 (vs. ulimit of 5500)
modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6884/8/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus avatar Aug 02 '24 14:08 hadoop-yetus

@steveloughran and @ahmarsuhail - I have addressed your comments. Could you please re-review ?

Had to force push

hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

shameersss1 avatar Aug 08 '24 10:08 shameersss1

@steveloughran @ahmarsuhail request for rereview Thanks

shameersss1 avatar Aug 12 '24 07:08 shameersss1

@ahmarsuhail Thanks a lot for your review. I have addressed the additional/minor comments. @steveloughran - Can you also please take a look ?

shameersss1 avatar Aug 16 '24 12:08 shameersss1

@steveloughran - Are we good to take this forward ? Please let me know if there is anything else to address

shameersss1 avatar Aug 25 '24 08:08 shameersss1

@steveloughran - Gentle reminder for review

shameersss1 avatar Sep 02 '24 05:09 shameersss1

@steveloughran - Gentle reminder for review

shameersss1 avatar Sep 11 '24 10:09 shameersss1

@shameersss1 you have to fix the conficts first

i am distracted right now by SDK regressions; not looking at anything else

steveloughran avatar Sep 16 '24 11:09 steveloughran

@steveloughran - I have resolved the merge conflict and force pushed.

shameersss1 avatar Sep 24 '24 06:09 shameersss1

@steveloughran - Due to changes spanning accross 50 files, This PR is prone to merge conflicts very often if left opened for a while. Could you please review this ?

shameersss1 avatar Oct 08 '24 06:10 shameersss1

:broken_heart: -1 overall

Vote Subsystem Runtime Logfile Comment
+0 :ok: reexec 1m 10s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 1s No case conflicting files found.
+0 :ok: codespell 0m 0s codespell was not available.
+0 :ok: detsecrets 0m 0s detect-secrets was not available.
+0 :ok: xmllint 0m 0s xmllint was not available.
+0 :ok: markdownlint 0m 0s markdownlint was not available.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
+1 :green_heart: test4tests 0m 0s The patch appears to include 17 new or modified test files.
_ trunk Compile Tests _
+0 :ok: mvndep 0m 20s Maven dependency ordering for branch
-1 :x: mvninstall 0m 23s /branch-mvninstall-root.txt root in trunk failed.
-1 :x: compile 0m 23s /branch-compile-root-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt root in trunk failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
-1 :x: compile 0m 40s /branch-compile-root-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt root in trunk failed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05.
-0 :warning: checkstyle 0m 24s /buildtool-branch-checkstyle-root.txt The patch fails to run checkstyle in root
-1 :x: mvnsite 3m 16s /branch-mvnsite-hadoop-common-project_hadoop-common.txt hadoop-common in trunk failed.
+1 :green_heart: javadoc 1m 53s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 :green_heart: javadoc 1m 16s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
-1 :x: spotbugs 0m 13s /branch-spotbugs-hadoop-common-project_hadoop-common.txt hadoop-common in trunk failed.
+0 :ok: spotbugs 0m 24s branch/hadoop-project no spotbugs output file (spotbugsXml.xml)
+1 :green_heart: shadedclient 38m 44s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 :ok: mvndep 6m 25s Maven dependency ordering for patch
-1 :x: mvninstall 0m 14s /patch-mvninstall-hadoop-common-project_hadoop-common.txt hadoop-common in the patch failed.
-1 :x: mvninstall 0m 20s /patch-mvninstall-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
+1 :green_heart: compile 17m 38s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
-1 :x: javac 17m 38s /results-compile-javac-root-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt root-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 generated 35 new + 0 unchanged - 0 fixed = 35 total (was 0)
+1 :green_heart: compile 16m 29s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
-1 :x: javac 16m 29s /results-compile-javac-root-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt root-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05 with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 generated 44 new + 0 unchanged - 0 fixed = 44 total (was 0)
+1 :green_heart: blanks 0m 0s The patch has no blanks issues.
-0 :warning: checkstyle 4m 19s /results-checkstyle-root.txt root: The patch generated 34 new + 0 unchanged - 0 fixed = 34 total (was 0)
-1 :x: mvnsite 0m 43s /patch-mvnsite-hadoop-common-project_hadoop-common.txt hadoop-common in the patch failed.
-1 :x: mvnsite 0m 47s /patch-mvnsite-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
+1 :green_heart: javadoc 2m 44s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 :green_heart: javadoc 2m 23s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+0 :ok: spotbugs 0m 38s hadoop-project has no data from spotbugs
-1 :x: spotbugs 0m 34s /patch-spotbugs-hadoop-common-project_hadoop-common.txt hadoop-common in the patch failed.
-1 :x: spotbugs 0m 43s /patch-spotbugs-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
+1 :green_heart: shadedclient 34m 43s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 :green_heart: unit 0m 39s hadoop-project in the patch passed.
-1 :x: unit 0m 37s /patch-unit-hadoop-common-project_hadoop-common.txt hadoop-common in the patch failed.
-1 :x: unit 0m 37s /patch-unit-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
+1 :green_heart: asflicense 1m 6s The patch does not generate ASF License warnings.
150m 26s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6884/17/artifact/out/Dockerfile
GITHUB PR https://github.com/apache/hadoop/pull/6884
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint
uname Linux ade6b4e278fa 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 37eead58f3d4350a560b3ef01c1e629ecd301bb7
Default Java Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6884/17/testReport/
Max. process+thread count 551 (vs. ulimit of 5500)
modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6884/17/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus avatar Oct 08 '24 11:10 hadoop-yetus

I will do this and other pending stuff once we've got 3.4.1 out the door -that is the only thing I'm focused on right now. The best thing you could do right now is help qualify that in your environments.

steveloughran avatar Oct 10 '24 15:10 steveloughran

@steveloughran Sure i would gladly be part of hadoop-3.4.1 release. Is there any specific thing you want me do? Should i run all the S3A integration tests for the 3.4.1 release candidate branch ? If you could you please point me to the same and the place where i should update the results.

shameersss1 avatar Oct 11 '24 04:10 shameersss1

@shameersss1 I'll send you the steps on how to do this

ahmarsuhail avatar Oct 11 '24 05:10 ahmarsuhail

@ahmarsuhail - Thanks Ahmar for the steps, I have signed off on ARM64 EC2 machine.

shameersss1 avatar Oct 18 '24 05:10 shameersss1