lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

Make GC work with EMR 7.0.0

Open idanovo opened this issue 7 months ago • 12 comments

idanovo avatar May 04 '25 10:05 idanovo

E2E Test Results - Quickstart

12 passed

github-actions[bot] avatar May 04 '25 10:05 github-actions[bot]

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed, 1 skipped

github-actions[bot] avatar May 04 '25 10:05 github-actions[bot]

@treeverse/product Please note this PR changes Hadoop version from 3.2.1 to 3.3.4

idanovo avatar May 07 '25 07:05 idanovo

Missing some context - why do we need to provide custom credentials chain and not work with the default aws sdk provides?

@nopcoder The code changes were necessary because EMR 7.x uses AWS SDK v2 for S3 operations, while our code was built against SDK v1. The credential providers from these SDKs are loaded by different classloaders in EMR, making them incompatible types at runtime despite similar interfaces. This reflection-based solution extracts the raw credentials without direct type casting, creating a bridge between the two systems. Without this adapter code, we'd face ClassCastExceptions when trying to use assumed role authentication. This approach allows us to maintain compatibility while the AWS ecosystem transitions to SDK v2

idanovo avatar May 07 '25 13:05 idanovo

so, we plan to still support the older version and the reflection works with the old and the new code?

nopcoder avatar May 07 '25 20:05 nopcoder

so, we plan to still support the older version and the reflection works with the old and the new code?

@nopcoder Old version you mean for EMR?

idanovo avatar May 08 '25 06:05 idanovo

so, we plan to still support the older version and the reflection works with the old and the new code?

@nopcoder Old version you mean for EMR?

yes. want to understand why we need to use reflection if we update the package we use.

nopcoder avatar May 08 '25 06:05 nopcoder

@nopcoder I've tried to avoid using reflection but the problem is that EMR 7.0.0 is using a Hadoop-specific credential provider (org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider) that doesn't implement the AWS SDK's AWSCredentialsProvider interface directly

idanovo avatar May 11 '25 14:05 idanovo

@Isan-Rivkin @arielshaqed @Jonathan-Rosenberg the PR is now ready to review

idanovo avatar May 11 '25 14:05 idanovo

@Isan-Rivkin specifically about how it was tested- I released a new Spark client and tried running it on EMR 7.0.0 (validated that objects actually deleted this time)

idanovo avatar May 11 '25 15:05 idanovo

@Isan-Rivkin @nopcoder After adding logs that print the packages we use, I better understood the issue and how to resolve it.

What Changed in EMR 7.0.0:

In EMR 7.0.0, Amazon upgraded to Hadoop 3.3.6, which included significant changes to how AWS credentials are handled. Specifically, EMR 7.0.0 uses Hadoop's own credential provider hierarchy in the S3A filesystem. When configuring role-based auth, it uses org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider which does not implement AWS SDK v1's AWSCredentialsProvider interface.

This change caused this error: ClassCastException: class org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider cannot be cast to class com.amazonaws.auth.AWSCredentialsProvider

Changed parameter type to support different credential providers

Before:

def createAndValidateS3Client(
    configuration: ClientConfiguration,
    credentialsProvider: Option[AWSCredentialsProvider],
    ...
)
After:
def createAndValidateS3Client(
    configuration: ClientConfiguration,
    credentialsProvider: Option[Any], // Changed to Any
    ...
)

Modified credential provider handling

Before:

val builderWithCredentials = credentialsProvider match {
  case Some(cp) => builderWithEndpoint.withCredentials(cp)
  case None     => builderWithEndpoint
}

After:

val builderWithCredentials = 
  if (roleArn != null && !roleArn.isEmpty) {
    // Use STS to assume role
    // ...
  } else if (credentialsProvider.isDefined && credentialsProvider.get.isInstanceOf[AWSCredentialsProvider]) {
    builderWithEndpoint.withCredentials(credentialsProvider.get.asInstanceOf[AWSCredentialsProvider])
  } else {
    builderWithEndpoint.withCredentials(new DefaultAWSCredentialsProviderChain())
  }

Added role-based auth support

After applying these changes, I face another issue: I receive "Access Denied" errors when trying to access S3 resources that require role permissions.

Before: No support for role assumption

After:

val roleArn = System.getProperty("spark.hadoop.fs.s3a.assumed.role.arn")
if (roleArn != null && !roleArn.isEmpty) {
  // If we have a role ARN configured, assume that role
  logger.info(s"Assuming role: $roleArn for S3 client")
  try {
    val sessionName = "lakefs-gc-" + UUID.randomUUID().toString
    val stsProvider = new STSAssumeRoleSessionCredentialsProvider
      .Builder(roleArn, sessionName)
      .withLongLivedCredentialsProvider(new DefaultAWSCredentialsProviderChain())
      .build()
    
    builderWithEndpoint.withCredentials(stsProvider)
  } catch {
    ...
  }
}

Support for assumed roles explanation I added explicit support for the spark.hadoop.fs.s3a.assumed.role.arn property because:

This configuration is what triggers EMR 7.0.0 to use the incompatible AssumedRoleCredentialProvider By detecting this property, we can use the same role ARN to create an equivalent AWS SDK v1 credential provider The implementation uses AWS SDK v1's STSAssumeRoleSessionCredentialsProvider, avoiding the need to cast or use reflection.

idanovo avatar May 13 '25 07:05 idanovo

1. EMR 7.0.0 Uses Hadoop 3.3.6

From [AWS EMR Release Notes](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-700-release.html):

2. Hadoop S3A credential provider structure

Looking at [Hadoop's S3A Documentation](https://hadoop.apache.org/docs/r3.3.6/hadoop-aws/tools/hadoop-aws/index.html#Assumed_Role_Credentials):

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider</value>
</property>

3. New interface incompatibility

From [AssumedRoleCredentialProvider.java](https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/auth/AssumedRoleCredentialProvider.java):

/**
 * Support for role assumption in the AWS SDK.
 */
public class AssumedRoleCredentialProvider implements AWSCredentialProviderList {
  // Note: Implements Hadoop's AWSCredentialProviderList, NOT AWS SDK's AWSCredentialsProvider
  // ...
}

The key point is that AssumedRoleCredentialProvider implements Hadoop's own interface AWSCredentialProviderList, not AWS SDK's com.amazonaws.auth.AWSCredentialsProvider.

4. How credential handling changed

In [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080), Hadoop moved to a more abstract credential system:

"This patch moves to a more abstract credential provider mechanism which allows for cleaner implementation of more complex credential providers, supporting features such as session delegation tokens and token refreshing."

This explains why in EMR 7.0.0, the credential providers follow a different class hierarchy and don't implement AWS SDK interfaces directly anymore.

The ClassCastException in our logs confirms this change in implementation:

Exception in thread "main" java.lang.ClassCastException: class org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider cannot be cast to class com.amazonaws.auth.AWSCredentialsProvider

idanovo avatar May 13 '25 07:05 idanovo