data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

S3 source - allow cross region access

Open brianmaresca opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. It would be nice to have [...] Currently, I do not see any way to have a single pipeline consume an s3 source (with sqs) for s3 buckets that are in different regions. It would be nice to have this ability.

Example scenario:

  • two s3 buckets, one in us-west-2 and us-east-1
  • each bucket has event notifications configured with an sns topic in their respective regions
  • a single sqs queue in us-east-1 that is subscribed to both topics above (one topic in us-east-1, another in us-west-2)
  • configuration yaml:
       source:
         s3:
           notification_type: "sqs"
           codec:
             newline:
           sqs:
             queue_url:   "https://sqs.us-east-1.amazonaws.com/123456789012/asdf"
           bucket_owners:
             my-bucket-in-us-west-2: 210987654321
             my-bucket-in-us-east-1: 123456789012
           aws:
             sts_role_arn: "arn:aws:iam::123456789012:role/asdf"
             region: us-east-1
    
    
    

With the above configuration, everything goes smoothly for us-east-1. However, the pipeline fails to get objects from the us-west-2 bucket because the s3 client is configured for us-east-1. The (not very informative) error log is: [s3-source-sqs-1] ERROR org.opensearch.dataprepper.plugins.source.s3.SqsWorker - Error processing from S3: null (Service: S3, Status Code: 400, Request ID: xxxx, Extended Request ID: xxxx)

Describe the solution you'd like Enable (or the option to enable) cross region access on the S3 client so it is able to download objects from buckets in regions other than the one defined in the yaml config. See https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/s3-cross-region.html.

potential solution, add .crossRegionAccessEnabled() to createS3Client in S3ClientBuilderFactory:

    public S3Client createS3Client() {
        LOG.info("Creating S3 client");
            return S3Client.builder()
                .crossRegionAccessEnabled(true)
                .region(s3SourceConfig.getAwsAuthenticationOptions().getAwsRegion())
                .credentialsProvider(credentialsProvider)
                    .overrideConfiguration(ClientOverrideConfiguration.builder()
                            .retryPolicy(retryPolicy -> retryPolicy.numRetries(5).build())
                            .build())
                    .build();
    }

Describe alternatives you've considered (Optional) Using a pipeline and sqs queue for each bucket that is in a different region. But this feels silly - extra sqs queue, pipeline, and duplicated configuration.

brianmaresca avatar Apr 26 '24 18:04 brianmaresca

@brianmaresca , Thank you for creating this detailed issue. It seems you are familiar with the solution. Would you be interested in creating a PR contribution for it?

dlvenable avatar Apr 30 '24 19:04 dlvenable

I'm interested in solving this by using the region information from the SQS queue. This can allow us to avoid two calls to the S3 API to load the data.

Additionally, we can perform STS authentication for the desired region.

Currently we use the region defined in the region property. e.g.

source:
  s3:
    aws:
      sts_role_arn: arn:aws:iam::123456789012:role/MyRole
      region: us-east-1

We can use the STS region for the target S3 bucket instead.

dlvenable avatar Sep 06 '24 18:09 dlvenable