solr-operator icon indicating copy to clipboard operation
solr-operator copied to clipboard

Backups via S3 using Web Identity Tokens not working

Open endzyme opened this issue 3 years ago • 5 comments

Summary

I tried backing up to S3 using IAM Role Assumption via Web Identity Tokens on EKS, and am getting errors. I tried with static AWS IAM credentials with the same policy and it works. There is an ominous WARN which may indicate why web identity token role assumption is not functioning.


Details

I am experiencing some issues using the S3 Backup and Restore configuration using the Solr Operator. I am running the 8.11 Solr image and have configured the pods on our EKS cluster with the appropriate Kubernetes Service Account and the Service Account is annotated in the proper way with the IAM Role ARN.

There is an interesting warning message when attempting to perform a backup. The shard leader will emit the message below before every attempt:

WARN  (OverseerThreadFactory-34-thread-2-processing-n:dev-8-blue-solrcloud-2.solr:80_solr) [c:test   ] s.a.a.a.c.i.WebIdentityCredentialsUtils To use web identity tokens, the 'sts' service module must be on the class path.

When I configure the SolrCloud resource with static IAM credentials I can perform the backup, but with the assumed role via web identity token I am receiving a 403 from S3 (see error message below).

2022-09-15 15:17:21.941 WARN  (OverseerThreadFactory-34-thread-1-processing-n:dev-8-blue-solrcloud-1.solr:80_solr) [c:test   ] s.a.a.a.c.i.WebIdentityCredentialsUtils To use web identity tokens, the 'sts' service module must be on the class path.
2022-09-15 15:17:24.447 ERROR (OverseerThreadFactory-34-thread-1-processing-n:dev-8-blue-solrcloud-1.solr:80_solr) [c:test   ] o.a.s.s.S3StorageClient An AmazonServiceException was thrown! [serviceName=S3] [awsRequestId=SNIP] [httpStatus=403] [s3ErrorCode=null] [message=null]
2022-09-15 15:17:24.449 ERROR (OverseerThreadFactory-34-thread-1-processing-n:dev-8-blue-solrcloud-1.solr:80_solr) [c:test   ] o.a.s.c.a.c.OverseerCollectionMessageHandler Collection: test operation: backup failed => org.apache.solr.s3.S3Exception: An AmazonServiceException was thrown! [serviceName=S3] [awsRequestId=SNIP] [httpStatus=403] [s3ErrorCode=null] [message=null]
        at org.apache.solr.s3.S3StorageClient.handleAmazonException(S3StorageClient.java:598)
org.apache.solr.s3.S3Exception: An AmazonServiceException was thrown! [serviceName=S3] [awsRequestId=SNIP] [httpStatus=403] [s3ErrorCode=null] [message=null]
        at org.apache.solr.s3.S3StorageClient.handleAmazonException(S3StorageClient.java:598) ~[?:?]
        at org.apache.solr.s3.S3StorageClient.pathExists(S3StorageClient.java:314) ~[?:?]
        at org.apache.solr.s3.S3BackupRepository.exists(S3BackupRepository.java:200) ~[?:?]
        at org.apache.solr.cloud.api.collections.BackupCmd.createAndValidateBackupPath(BackupCmd.java:154) ~[?:?]
        at org.apache.solr.cloud.api.collections.BackupCmd.call(BackupCmd.java:94) ~[?:?]
        at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:271) ~[?:?]
        at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:524) ~[?:?]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
        at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 403, Request ID: SNIP, Extended Request ID: SNIP)
        at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156) ~[?:?]
        at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:106) ~[?:?]
        at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:84) ~[?:?]
        at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:42) ~[?:?]
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:95) ~[?:?]
        at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$6(BaseClientHandler.java:232) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:30) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:80) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[?:?]
        at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56) ~[?:?]
        at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:48) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:31) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37) ~[?:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) ~[?:?]
        at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193) ~[?:?]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) ~[?:?]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:167) ~[?:?]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82) ~[?:?]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:175) ~[?:?]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76) ~[?:?]
        at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) ~[?:?]
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56) ~[?:?]
        at software.amazon.awssdk.services.s3.DefaultS3Client.headObject(DefaultS3Client.java:5080) ~[?:?]
        at software.amazon.awssdk.services.s3.S3Client.headObject(S3Client.java:9886) ~[?:?]
        at org.apache.solr.s3.S3StorageClient.pathExists(S3StorageClient.java:309) ~[?:?]
        ... 9 more

Below are the things I've tried and observed:

  • Tested the IAM Role itself for access to the target bucket
  • Confirmed that the mutating webhook is in fact modifying the SolrCloud pods with the appropriate env vars and projected service account token volume mounts
  • Confirmed that I can use those tokens to assume the role and get to the S3 bucket
  • Tested with "static" IAM credentials via the kubectl explain solrcloud.spec.backupRepositories.s3.credentials.credentialsFileSecret configuration, the IAM user has the same policy as the IAM role, and this works for backups

This warning about the 'sts' service module must be on the class path message makes me think that something else needs to be loaded in the Solr modules before this will work. I have looked through the documentation and everything seems to indicate that, when using EKS, it's a supported use case to use IAM Roles through Service Accounts. The documentation also appears to indicate that I do not need to specify anything extra in modules or plugins for SolrCloud K8s Resource because they are autoloaded when providing backup configurations of S3.

Any help would be appreciated!

endzyme avatar Sep 15 '22 15:09 endzyme

As far as I can tell, the sts module isn't included in the s3-repository contrib module: https://repo1.maven.org/maven2/org/apache/solr/solr-s3-repository/8.11.2/solr-s3-repository-8.11.2.pom This lacks:

<dependency>
  <groupId>software.amazon.awssdk</groupId>
  <artifactId>sts</artifactId>
  <version>2.15.20</version>
</dependency>

So this may be on the Solr repo, not the Solr Operator repo to fix, but definitely related. I'm only so-so at loading java stuff in, so I'm not sure if this can be bundled up separately and added in, or if we need to have a new version of that contrib library.

joshsouza avatar Sep 22 '22 22:09 joshsouza

I have confirmed this by adding the following to the helm values (which propagate to the SolrCloud CRD):


podOptions:
  initContainers:
    - name: "add-sts"
      image: "alpine/curl"
      command: 
        - "curl"
        - "https://repo1.maven.org/maven2/software/amazon/awssdk/sts/2.15.54/sts-2.15.54.jar",
        - "-o"
        - "/output/sts-2.15.54.jar"
      volumeMounts: 
        - name: "sts"
          mountPath: "/output"

  volumes:
    - name: "sts"
      source:
        emptyDir: {}
      defaultContainerMount:
        name: "sts"
        mountPath: "/opt/solr/dist/sts-2.15.54.jar"
        subPath: "sts-2.15.54.jar"

Clearly this isn't a long-term solution, but this puts the sts module in the /opt/solr/dist directory on the pods, thus adding it to the classpath that the Operator puts in place, and have confirmed that after doing so S3 backups are working properly.

Should we open an issue with the main Apache Solr project to address this?

joshsouza avatar Sep 22 '22 22:09 joshsouza

Thanks @joshsouza!!

endzyme avatar Sep 22 '22 23:09 endzyme

Hey y'all thanks for this discussion! If y'all make a Solr issue (and/or PR, we always love new contributors!) then I'll help shepherd it through. I'm out for the next two weeks but this probably wouldn't make the 9.1 release anyways, so there shouldn't be any rush.

HoustonPutman avatar Sep 23 '22 15:09 HoustonPutman

@joshsouza opened https://issues.apache.org/jira/browse/SOLR-16429

risdenk avatar Sep 23 '22 17:09 risdenk

This will be fixed in the next releases (8.11.3 and 9.1.0), so I think we can go ahead and close this.

HoustonPutman avatar Oct 21 '22 16:10 HoustonPutman