cortex
cortex copied to clipboard
feat(aws assumerolewithwebidentity): fixed s3 access for ruler to use…
fixed s3 access for ruler to use assumerolewithwebidentity in an IRSA setup on AWS
This PR includes some code to use assume role with web identity and utilize standard env. variables to enable IRSA.
Which issue(s) this PR fixes: Fixes 3740
Checklist
- [x] Tests updated
- [x] Documentation added
- [x]
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
The default credential resolver should support web identity already; I don't think we need to implement assume role with web identity manually here.
@alvinlin123, the aws client sdk implements the assumeRoleWithWebIdentity correctly, however as far as I could tell, for the ruler and the alertmanager the sts assumeRoleWithWebIdentity calls are issued against the s3 endpoint instead of the STS endpoint. Might this be due to an incomplete configuration on client initialization?
Hmm this is weird, because we run alert manager and ruler using IRSA too, without any issue. We might need to dig deeper into what is happening for you. Do you have the latest error message?
Most like you are right, the client initialization may be in complement or there may be some other env var in play here. Would it be possible to maybe do a build with debug logging turned on for the session, and see what's going on?
I will do some code reading in the meanwhile.
I'll add a detailled bug report with debug logging tomorrow.
@blut also, if you can post your alertmanager/ruler config (include s3 client) it may help me to troubleshoot :)
Also do you know if the environment you are running allows global STS endpoint (ttps://sts.amazonaws.com)? I had some customers hitting weird issues because their firewall/proxy don't allow the global STS endpoint. Would setting an env variable AWS_STS_REGIONAL_ENDPOINTS=regional
be something you can test as well? I am scratching my head because I just doubled confirmed that my alertmanager and ruler env is using IRSA, and is not having any issue.
And it's not that I don't want to merge this PR, I am more worry about AWS SDK has a bug or something; that's why I appreciate your help on this :-)
Hi @alvinlin123 I've attached the debug.log, but I think the interesting message is the following error:
caused by: SerializationError: failed to unmarshal error message
status code: 405, request id:
caused by: UnmarshalError: failed to unmarshal error message
00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version=\"1|
00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 |.0\" encoding=\"UT|
00000020 46 2d 38 22 3f 3e 0a 3c 45 72 72 6f 72 3e 3c 43 |F-8\"?>.<Error><C|
00000030 6f 64 65 3e 4d 65 74 68 6f 64 4e 6f 74 41 6c 6c |ode>MethodNotAll|
00000040 6f 77 65 64 3c 2f 43 6f 64 65 3e 3c 4d 65 73 73 |owed</Code><Mess|
00000050 61 67 65 3e 54 68 65 20 73 70 65 63 69 66 69 65 |age>The specifie|
00000060 64 20 6d 65 74 68 6f 64 20 69 73 20 6e 6f 74 20 |d method is not |
00000070 61 6c 6c 6f 77 65 64 20 61 67 61 69 6e 73 74 20 |allowed against |
00000080 74 68 69 73 20 72 65 73 6f 75 72 63 65 2e 3c 2f |this resource.</|
00000090 4d 65 73 73 61 67 65 3e 3c 4d 65 74 68 6f 64 3e |Message><Method>|
000000a0 50 4f 53 54 3c 2f 4d 65 74 68 6f 64 3e 3c 52 65 |POST</Method><Re|
000000b0 73 6f 75 72 63 65 54 79 70 65 3e 53 45 52 56 49 |sourceType>SERVI|
000000c0 43 45 3c 2f 52 65 73 6f 75 72 63 65 54 79 70 65 |CE</ResourceType|
000000d0 3e 3c 52 65 71 75 65 73 74 49 64 3e 59 4a 33 42 |><RequestId>YJ3B|
000000e0 43 37 4a 4a 47 56 34 37 45 45 59 45 3c 2f 52 65 |C7JJGV47EEYE</Re|
000000f0 71 75 65 73 74 49 64 3e 3c 48 6f 73 74 49 64 3e |questId><HostId>|
00000100 55 55 77 71 55 70 51 54 74 6c 44 67 35 54 7a 2f |UUwqUpQTtlDg5Tz/|
00000110 7a 55 42 57 2b 79 73 4f 55 36 75 67 53 2f 4d 6d |zUBW+ysOU6ugS/Mm|
00000120 4e 2b 45 32 52 62 56 66 66 4b 47 72 56 65 31 5a |N+E2RbVffKGrVe1Z|
00000130 7a 76 49 51 77 35 34 34 32 4f 4d 4f 47 77 37 73 |zvIQw5442OMOGw7s|
00000140 6c 2f 44 45 70 39 61 38 55 53 30 3d 3c 2f 48 6f |l/DEp9a8US0=</Ho|
00000150 73 74 49 64 3e 3c 2f 45 72 72 6f 72 3e |stId></Error>|
caused by: unknown error response tag, {{ Error} []}```
This error happend with the following configuration:
``` - args:
- -log.level=debug
- -api.response-compression-enabled=true
- -blocks-storage.backend=s3
- -blocks-storage.s3.bucket-name=cortex-storage-uash1kei
- -blocks-storage.s3.endpoint=s3.eu-central-1.amazonaws.com
- -consul.hostname=
- -distributor.health-check-ingesters=true
- -distributor.replication-factor=3
- -distributor.shard-by-all-labels=true
- -dynamodb.api-limit=10
- -dynamodb.url=https://eu-central-1
- -experimental.ruler.enable-api=true
- -memberlist.abort-if-join-fails=false
- -memberlist.bind-port=7946
- -memberlist.join=gossip-ring.cortex.svc.cluster.local:7946
- -querier.query-ingesters-within=13h
- -querier.query-store-after=12h
- -querier.store-gateway-addresses=store-gateway:9095
- -ring.heartbeat-timeout=10m
- -ring.prefix=
- -ring.store=memberlist
- -ruler-storage.backend=s3
- -ruler-storage.s3.bucket-name=cortex-storage-uash1kei
- -ruler-storage.s3.region=eu-central-1
- -ruler.alertmanager-url=http://alertmanager.cortex.svc.cluster.local/alertmanager
- -ruler.enable-sharding=true
- -ruler.max-rule-groups-per-tenant=20
- -ruler.max-rules-per-rule-group=15
- -ruler.ring.consul.hostname=
- -ruler.ring.store=memberlist
- -ruler.storage.s3.buckets=cortex-storage-uash1kei
- -ruler.storage.s3.endpoint=s3.eu-central-1.amazonaws.com
- -consul.hostname=
- -ruler.storage.s3.force-path-style=false
- -ruler.storage.s3.region=eu-central-1
- -ruler.storage.type=s3
- -runtime-config.file=/etc/cortex/overrides.yaml
- -s3.url=https://eu-central-1/cortex-storage-uash1kei
- -schema-config-file=/etc/cortex/schema/config.yaml
- -store.cardinality-limit=1000000
- -store.engine=blocks
- -store.max-query-length=768h
- -target=ruler
env:
- name: AWS_STS_REGIONAL_ENDPOINT
value: regional
As seen in the attached pod.yaml, the required AWS_ROLE_ARN
and AWS_WEB_IDENTITY_TOKEN_FILE
are also set through EKS.
I've added the AWS_STS_REGIONAL_ENDPOINT as suggested.
Edit: The firewall should not be an issue, since the ruler and all the other cortex components are deployed to the same cluster & nodes. The cortex components also share the same serviceaccount.
I will take a closer look. Thank you for getting back. I will take a look asap :)
I will take a closer look. Thank you for getting back. I will take a look asap :)
Hi @alvinlin123, did you find a chance to check out my configuration?
@blut I'll take a look today, forgot to ask which commit/version of Cortex you are using?
@blut I think I know what's going on. Can you remove the -ruler.storage.s3.endpoint=s3.eu-central-1.amazonaws.com
config?
The config result in AWS SDK's WithEndpoint
method method to be called, I vaguely remember that the endpoint set by WithEndpiont
method are used for any calls, including calls to STS. This explains why you are seeing error message from S3 when calling STS with WebIdentity.
We're still on cortex v1.9.0, deployed to kubernetes using tanka. Separately for the ruler deployment I've tried upgrading the image to v1.11.1 and with removed ruler.storage.s3.endpoint, I get a much funnier error:
level=error ts=2022-06-28T10:05:40.240948562Z caller=ruler.go:481 msg="unable to list rules" err="WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.dummy.amazonaws.com/\": dial tcp: lookup sts.dummy.amazonaws.com on 172.20.0.10:53: no such host"
It appears the region is defined somewhere separately.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.
@blut do you still have same issue?
Hi @alvinlin123, we've switched to Mimir, where this issue is resolved. Feel free to close.