cloud-on-k8s
cloud-on-k8s copied to clipboard
EKS IRSA with Elasticsearch uses incorrect role
@pebrc I'm sorry, not sure if i should open a new ticket, but this one might have to be reopened.
We are using elasticsearch on EKS with IRSA, the ES version is
Version: 8.14.1, Build: docker/93a57a1a76f556d8aee6a90d1a95b06187501310/2024-06-10T23:35:17.114581191Z, JVM: 22.0.1which should contain the fix.The relevant part of the ES manifest is this one i believe (we're deploying ES CRD with helm but that shouldn't matter there):
env: - name: AWS_ROLE_SESSION_NAME value: "{{ include "es.name" .}}-cluster-elasticsearch" - name: AWS_WEB_IDENTITY_TOKEN_FILE value: "/usr/share/elasticsearch/config/repository-s3/aws-web-identity-token-file" - name: AWS_ROLE_ARN value: {{ .Values.serviceAccount.roleArn }} volumeMounts: - name: aws-iam-token mountPath: /usr/share/elasticsearch/config/repository-s3 volumes: - name: aws-iam-token projected: defaultMode: 420 sources: - serviceAccountToken: audience: sts.amazonaws.com expirationSeconds: 86400 path: aws-web-identity-token-fileI can verify that symlink is getting created on the node:
elasticsearch@es-prd-es-data-2:~$ ls -al /usr/share/elasticsearch/config/repository-s3 total 4 drwxrwsrwt 3 root elasticsearch 100 Oct 7 14:08 . drwxrwsrwx 9 root elasticsearch 4096 Oct 7 14:08 .. drwxr-sr-x 2 root elasticsearch 60 Oct 7 14:08 ..2024_10_07_14_08_45.58930580 lrwxrwxrwx 1 root elasticsearch 30 Oct 7 14:08 ..data -> ..2024_10_07_14_08_45.58930580 lrwxrwxrwx 1 root elasticsearch 34 Oct 7 14:08 aws-web-identity-token-file -> ..data/aws-web-identity-token-fileAnd I can verify that restore/backup operation works when the cluster is just created.
However, after some time (presumably a couple of days) the restoration stops working, however it seems that this does not stop backup process from working (there are files in s3 and no complains in the logs). In the logs there's an error
org.elasticsearch.indices.recovery.RecoveryFailedException: [myindexname][2]: Recovery failed on {es-prd-es-data-1}{dho1YZSWRnqxL1YGJABMkA}{H-QVYCvUTxqJ14ARKQMz6g}{es-prd-es-data-1}{192.168.167.134}{192.168.167.134:9300}{d}{8.14.1}{7000099-8505000}{xpack.installed=true, transform.config_version=10.0.0, k8s_node_name=ip-10-200-100-6.eu-central-1.compute.internal, ml.config_version=12.0.0} at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$36(IndexShard.java:3319) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179) at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$9(StoreRecovery.java:402) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73) at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31) at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$19(StoreRecovery.java:600) at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191) at org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:394) at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:306) at org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:331) at org.elasticsearch.action.support.SubscribableListener.onFailure(SubscribableListener.java:250) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73) at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31) at org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:394) at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:306) at org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:331) at org.elasticsearch.action.support.SubscribableListener.onFailure(SubscribableListener.java:250) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73) at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31) at org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:394) at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:306) at org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:331) at org.elasticsearch.action.support.SubscribableListener.onFailure(SubscribableListener.java:250) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$55(BlobStoreRepository.java:3330) at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73) at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31) at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onFailure(ActionListenerImplementations.java:317) at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:151) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:967) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1570) Caused by: [myindexname/MOkZmjo8T6K46_k4wmAmqA][[myindexname][2]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery ... 42 more Caused by: [myindexname/MOkZmjo8T6K46_k4wmAmqA][[myindexname][2]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [new-07-10-24/pB1C3ib7S9-L73yPMU5y9g] ... 14 more Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBZMTJXWXJQKYZ5; S3 Extended Request ID: n7pzR2kAQkQwUl/Ojli4NjoyYckzwY15CfWlRwJwWXLnblKwoymHXiaDQk3A96JqQfQvO/3nxVs=; Proxy: null) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403) at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1524) at org.elasticsearch.repositories.s3.S3RetryingInputStream.lambda$openStreamWithRetry$0(S3RetryingInputStream.java:100) at java.security.AccessController.doPrivileged(AccessController.java:319) at org.elasticsearch.repositories.s3.SocketAccess.doPrivileged(SocketAccess.java:31) at org.elasticsearch.repositories.s3.S3RetryingInputStream.openStreamWithRetry(S3RetryingInputStream.java:100) at org.elasticsearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:85) at org.elasticsearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:67) at org.elasticsearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:104) at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:123) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.loadShardSnapshot(BlobStoreRepository.java:3629) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$56(BlobStoreRepository.java:3347) at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:100) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ... 3 more Suppressed: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBWZA7Q51PEW6DV; S3 Extended Request ID: iGEfqH4690vYswR7GlaM9SfJcG0ZLYOobdnU+yDviPT9xKunhEVklcYoNvCsQvyFcopsQgJArIQ=; Proxy: null) ... 30 more Suppressed: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBJ8E9ND66D4CB8; S3 Extended Request ID: J9YYcZGRbVdceB8Ny3Rt33+vsGT/3TwzStlGRPy8yKaROLvOT+yWv5K0hRtlBNJY9Nz4p1X5/uk=; Proxy: null) ... 30 more Suppressed: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBTMKXD6H0BWYF7; S3 Extended Request ID: mHqdDKbOFJs7LPgVSfgWUOiI6K7VEMHBWPtvLxm/P5uJKTJPSXROTqKJjzupO1VC+4rnZRhW1yw=; Proxy: null) ... 30 morethat boils down to just that:
User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject actionWhich means that the original role from IRSA and AWS_WEB_IDENTITY_TOKEN_FILE is dropped when ES tried to do that, and it used a node role (
prd-node-role-20241001140543910500000003) as opposed to the pod role (prd-ESBackupRole). Installing AWS-cli on the machine shows the correct role (prd-ESBackupRole):PAGER='' HOME=/tmp/ ./aws/dist/aws sts get-caller-identity { "UserId": "WHATEVER:es-cluster-elasticsearch", "Account": "123456789012", "Arn": "arn:aws:sts::123456789012:assumed-role/prd-ESBackupRole/es-cluster-elasticsearch" }I'd really appreciate any pointers or help in order to troubleshoot that. Thanks in advance.
Originally posted by @ragne in #7208