cloud-on-k8s icon indicating copy to clipboard operation
cloud-on-k8s copied to clipboard

EKS IRSA with Elasticsearch uses incorrect role

Open pebrc opened this issue 1 year ago • 3 comments

@pebrc I'm sorry, not sure if i should open a new ticket, but this one might have to be reopened.

We are using elasticsearch on EKS with IRSA, the ES version is Version: 8.14.1, Build: docker/93a57a1a76f556d8aee6a90d1a95b06187501310/2024-06-10T23:35:17.114581191Z, JVM: 22.0.1 which should contain the fix.

The relevant part of the ES manifest is this one i believe (we're deploying ES CRD with helm but that shouldn't matter there):

              env:
                - name: AWS_ROLE_SESSION_NAME
                  value: "{{ include "es.name" .}}-cluster-elasticsearch"
                - name: AWS_WEB_IDENTITY_TOKEN_FILE
                  value: "/usr/share/elasticsearch/config/repository-s3/aws-web-identity-token-file"
                - name: AWS_ROLE_ARN
                  value: {{ .Values.serviceAccount.roleArn }}
              volumeMounts:
                - name: aws-iam-token
                  mountPath: /usr/share/elasticsearch/config/repository-s3
          volumes:
            - name: aws-iam-token
              projected:
                defaultMode: 420
                sources:
                  - serviceAccountToken:
                      audience: sts.amazonaws.com
                      expirationSeconds: 86400
                      path: aws-web-identity-token-file

I can verify that symlink is getting created on the node:

elasticsearch@es-prd-es-data-2:~$ ls -al /usr/share/elasticsearch/config/repository-s3
total 4
drwxrwsrwt 3 root elasticsearch  100 Oct  7 14:08 .
drwxrwsrwx 9 root elasticsearch 4096 Oct  7 14:08 ..
drwxr-sr-x 2 root elasticsearch   60 Oct  7 14:08 ..2024_10_07_14_08_45.58930580
lrwxrwxrwx 1 root elasticsearch   30 Oct  7 14:08 ..data -> ..2024_10_07_14_08_45.58930580
lrwxrwxrwx 1 root elasticsearch   34 Oct  7 14:08 aws-web-identity-token-file -> ..data/aws-web-identity-token-file

And I can verify that restore/backup operation works when the cluster is just created.

However, after some time (presumably a couple of days) the restoration stops working, however it seems that this does not stop backup process from working (there are files in s3 and no complains in the logs). In the logs there's an error

org.elasticsearch.indices.recovery.RecoveryFailedException: [myindexname][2]: Recovery failed on {es-prd-es-data-1}{dho1YZSWRnqxL1YGJABMkA}{H-QVYCvUTxqJ14ARKQMz6g}{es-prd-es-data-1}{192.168.167.134}{192.168.167.134:9300}{d}{8.14.1}{7000099-8505000}{xpack.installed=true, transform.config_version=10.0.0, k8s_node_name=ip-10-200-100-6.eu-central-1.compute.internal, ml.config_version=12.0.0}
	at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$36(IndexShard.java:3319)
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
	at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179)
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$9(StoreRecovery.java:402)
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
	at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179)
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
	at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
	at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31)
	at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$19(StoreRecovery.java:600)
	at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186)
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
	at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191)
	at org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:394)
	at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:306)
	at org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:331)
	at org.elasticsearch.action.support.SubscribableListener.onFailure(SubscribableListener.java:250)
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
	at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
	at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31)
	at org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:394)
	at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:306)
	at org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:331)
	at org.elasticsearch.action.support.SubscribableListener.onFailure(SubscribableListener.java:250)
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
	at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
	at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31)
	at org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:394)
	at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:306)
	at org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:331)
	at org.elasticsearch.action.support.SubscribableListener.onFailure(SubscribableListener.java:250)
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$55(BlobStoreRepository.java:3330)
	at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186)
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
	at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191)
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
	at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
	at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31)
	at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onFailure(ActionListenerImplementations.java:317)
	at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:151)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:967)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1570)
Caused by: [myindexname/MOkZmjo8T6K46_k4wmAmqA][[myindexname][2]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
	... 42 more
Caused by: [myindexname/MOkZmjo8T6K46_k4wmAmqA][[myindexname][2]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [new-07-10-24/pB1C3ib7S9-L73yPMU5y9g]
	... 14 more
Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBZMTJXWXJQKYZ5; S3 Extended Request ID: n7pzR2kAQkQwUl/Ojli4NjoyYckzwY15CfWlRwJwWXLnblKwoymHXiaDQk3A96JqQfQvO/3nxVs=; Proxy: null)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1524)
	at org.elasticsearch.repositories.s3.S3RetryingInputStream.lambda$openStreamWithRetry$0(S3RetryingInputStream.java:100)
	at java.security.AccessController.doPrivileged(AccessController.java:319)
	at org.elasticsearch.repositories.s3.SocketAccess.doPrivileged(SocketAccess.java:31)
	at org.elasticsearch.repositories.s3.S3RetryingInputStream.openStreamWithRetry(S3RetryingInputStream.java:100)
	at org.elasticsearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:85)
	at org.elasticsearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:67)
	at org.elasticsearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:104)
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:123)
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.loadShardSnapshot(BlobStoreRepository.java:3629)
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$56(BlobStoreRepository.java:3347)
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:100)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	... 3 more
	Suppressed: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBWZA7Q51PEW6DV; S3 Extended Request ID: iGEfqH4690vYswR7GlaM9SfJcG0ZLYOobdnU+yDviPT9xKunhEVklcYoNvCsQvyFcopsQgJArIQ=; Proxy: null)
		... 30 more
	Suppressed: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBJ8E9ND66D4CB8; S3 Extended Request ID: J9YYcZGRbVdceB8Ny3Rt33+vsGT/3TwzStlGRPy8yKaROLvOT+yWv5K0hRtlBNJY9Nz4p1X5/uk=; Proxy: null)
		... 30 more
	Suppressed: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBTMKXD6H0BWYF7; S3 Extended Request ID: mHqdDKbOFJs7LPgVSfgWUOiI6K7VEMHBWPtvLxm/P5uJKTJPSXROTqKJjzupO1VC+4rnZRhW1yw=; Proxy: null)
		... 30 more

that boils down to just that:

User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action

Which means that the original role from IRSA and AWS_WEB_IDENTITY_TOKEN_FILE is dropped when ES tried to do that, and it used a node role (prd-node-role-20241001140543910500000003) as opposed to the pod role (prd-ESBackupRole). Installing AWS-cli on the machine shows the correct role (prd-ESBackupRole):

 PAGER='' HOME=/tmp/ ./aws/dist/aws sts get-caller-identity
{
    "UserId": "WHATEVER:es-cluster-elasticsearch",
    "Account": "123456789012",
    "Arn": "arn:aws:sts::123456789012:assumed-role/prd-ESBackupRole/es-cluster-elasticsearch"
}

I'd really appreciate any pointers or help in order to troubleshoot that. Thanks in advance.

Originally posted by @ragne in #7208

pebrc avatar Oct 09 '24 08:10 pebrc