data-on-eks
data-on-eks copied to clipboard
Granting S3 access to karpenter nodes
- [x] ✋ I have searched the open/closed issues and my issue is not listed.
Please describe your question here
Thanks for the great examples! I altered the jupyterhub on eks example (for a private cluster accessed via a Tailscale VPN) and I'm now adding a ray cluster and trying to grant S3 access to the jobs running on karpenter nodes. I was trying to use the same karpenter provisioners but how do I grant the jobs S3 access?
- The ray example uses the
terraform-aws-modules/eks/aws//modules/karpentermodule and attaches the relevant policies via theiam_role_additional_policiesargument which is pretty straightforward. - The jupyterhub example (which I currently have running) uses
aws-ia/eks-blueprints-addons/awswhich ultimately usesaws-ia/eks-blueprints-addon/aws. The two things I've tried that hasn't worked is:
- attaching the relevant policies via the
role_policiesinput
karpenter = {
role_policies = {
bucket1_get_policy = bucket1_get_policy_arn
bucket2_get_policy = bucket2_get_policy_arn
}
}
- using
aws_iam_role_policy_attachmentresources withrole = module.eks_blueprints_addons.karpenter.iam_role_name.
resource "aws_iam_role_policy_attachment" "karpenter_s3_access" {
for_each = toset([
bucket1_get_policy_arn,
bucket2_get_policy_arn,
])
role = module.eks_blueprints_addons.karpenter.node_instance_profile_name
policy_arn = each.value
}
Also is there a preference for which module to use?
Provide a link to the example/module related to the question
Additional context
I may just follow the ray example and generate karpenter resources outside of aws-ia/eks-blueprints-addons/aws.
Also here's the policy I'm attaching:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Download",
"Effect": "Allow",
"Action": [
"s3:List*",
"s3:Get*"
],
"Resource": [
"${bucket_arn}",
"${bucket_arn}/*"
]
},
{
"Sid": "Decrypt",
"Effect": "Allow",
"Action": [
"kms:Decrypt"
],
"Resource": [
"${kms_key_arn}"
]
}
]
}
And the error I'm getting is:
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
Thanks for the great examples! I altered the jupyterhub on eks example (for a private cluster accessed via a Tailscale VPN) and I'm now adding a ray cluster and trying to grant S3 access to the jobs running on karpenter nodes.
Amazing! Tailscale is a great addition.
While we did show S3 access via the karpenter module's iam_role_additional_policies IIRC it was done so because RayCluster helm chart at the time did not have support for specifying a serviceAccountName as value. I just looked and it seems they have now added support for it, i.e. you can now specify head.serviceAccountName and worker.serviceAccountName as helm values which is great. This enables us to use an IAM Roles for Service Accounts which is demonstrated in the JupyterHub example. This would be the preferred away over the karpenter node role as this would restrict access to S3 buckets only to the RayCluster pods.
I will update the ray blueprint soon but you can go about it on your own as it is shown for the JupyterHub example here ... https://github.com/awslabs/data-on-eks/blob/44cb0769afc752e57bfb2d11192ebcec1ce97389/ai-ml/jupyterhub/jupyterhub.tf#L4-L48
Then use this serviceAccountName as value in the RayCluster helm chart (if you are using helm) or directly in RayCluster.yaml.
Optionally you can use our aws-ia/eks-blueprints-addons/aws module which we have provided as a convenience if you want to avoid some of the boiler plate code to create the helm_release and IRSA in a single shot (this is what I will use).
HTH, and please let us know if you run into any issues.
This sounds fantastic! When I tried implementing it I get the following error:
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity
I created an additional policy and attached it to the ray_single_user_irsa but it results in the same error. Am I attaching it to the wrong role?
Here's my code:
resource "kubernetes_namespace" "ray" {
metadata {
name = "ray"
}
}
module "ray_single_user_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
role_name = "${data.terraform_remote_state.eks.outputs.cluster_name}-ray-single-user-sa"
role_policy_arns = {
bucket1_get_policy = bucket1_get_policy_arn
bucket2_get_policy = bucket2_get_policy_arn
sts_policy = module.ray_policy.arn
}
oidc_providers = {
main = {
provider_arn = data.terraform_remote_state.eks.outputs.oidc_provider_arn
namespace_service_accounts = ["${kubernetes_namespace.ray.metadata[0].name}:ray-single-user"]
}
}
}
resource "kubernetes_service_account_v1" "ray_single_user_sa" {
metadata {
name = "${data.terraform_remote_state.eks.outputs.cluster_name}-ray-single-user"
namespace = kubernetes_namespace.ray.metadata[0].name
annotations = { "eks.amazonaws.com/role-arn" : module.ray_single_user_irsa.iam_role_arn }
}
automount_service_account_token = true
}
resource "kubernetes_secret_v1" "ray_single_user" {
metadata {
name = "${data.terraform_remote_state.eks.outputs.cluster_name}-ray-single-user-secret"
namespace = kubernetes_namespace.ray.metadata[0].name
annotations = {
"kubernetes.io/service-account.name" = kubernetes_service_account_v1.ray_single_user_sa.metadata[0].name
"kubernetes.io/service-account.namespace" = kubernetes_namespace.ray.metadata[0].name
}
}
type = "kubernetes.io/service-account-token"
}
module "ray_policy" {
source = "terraform-aws-modules/iam/aws//modules/iam-policy"
version = "~> 5.20"
name = "RayPolicy"
description = "IAM Policy to allow ray to function"
policy = jsonencode(
{
Version = "2012-10-17"
Statement = [
{
Sid = "AssumeRoleWithWebIdentity"
Effect = "Allow"
Action = ["sts:AssumeRoleWithWebIdentity"]
Resource = ["*"]
},
]
}
)
}
The full error:
ray::convert_dataset() (pid=458, ip=100.64.160.5)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/working_dir_files/_ray_pkg_20d44c85bf820c86/rb_analysis/rb/images/ngff.py", line 690, in convert_dataset
if output_path.exists():
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/cloudpath.py", line 389, in exists
return self.client._exists(self)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 179, in _exists
return self._s3_file_query(cloud_path) is not None
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 197, in _s3_file_query
return next(
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 198, in <genexpr>
(
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/boto3/resources/collection.py", line 81, in __iter__
for page in self.pages():
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/boto3/resources/collection.py", line 171, in pages
for page in pages:
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/paginate.py", line 269, in __iter__
response = self._make_request(current_kwargs)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/paginate.py", line 357, in _make_request
return self._method(**current_kwargs)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 534, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 959, in _make_api_call
http, parsed_response = self._make_request(
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 982, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
return self._send_request(request_dict, operation_model)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/endpoint.py", line 198, in _send_request
request = self.create_request(request_dict, operation_model)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/endpoint.py", line 134, in create_request
self._event_emitter.emit(
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
return self._emit(event_name, kwargs)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
response = handler(**kwargs)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/signers.py", line 105, in handler
return self.sign(operation_name, request)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/signers.py", line 180, in sign
auth = self.get_auth_instance(**kwargs)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/signers.py", line 284, in get_auth_instance
frozen_credentials = self._credentials.get_frozen_credentials()
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 610, in get_frozen_credentials
self._refresh()
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 498, in _refresh
self._protected_refresh(is_mandatory=is_mandatory_refresh)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 514, in _protected_refresh
metadata = self._refresh_using()
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 661, in fetch_credentials
return self._get_cached_credentials()
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 671, in _get_cached_credentials
response = self._get_credentials()
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 905, in _get_credentials
return client.assume_role_with_web_identity(**kwargs)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 534, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 976, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity
I've also passed kubernetes_service_account_v1.ray_single_user_sa.metadata[0].name to both head.serviceAccountName and worker.serviceAccountName.
It looks like the role is mounted correctly:
❯ kubectl -n ray exec -it ray-cluster-cpu-kuberay-head-5ktw2 -- env | grep AWS
Defaulted container "ray-head" out of: ray-head, autoscaler
AWS_STS_REGIONAL_ENDPOINTS=regional
AWS_REGION=us-west-2
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_DEFAULT_REGION=us-west-2
AWS_ROLE_ARN=arn:aws:iam::XXXXXXXXXX:role/eks-stage-ray-single-user-sa
Here's Jupyterhub's for reference:
❯ kubectl -n jupyterhub exec -it jupyter-evan -- env | grep AWS
Defaulted container "notebook" out of: notebook, block-cloud-metadata (init)
AWS_DEFAULT_REGION=us-west-2
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_STS_REGIONAL_ENDPOINTS=regional
AWS_ROLE_ARN=arn:aws:iam::XXXXXXXXXX:role/eks-stage-jupyterhub-single-user-sa
AWS_REGION=us-west-2
Also here's the trust relationship for the role via the aws console:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::XXXXXXXXXX:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX:sub": "system:serviceaccount:ray:ray-single-user",
"oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX:aud": "sts.amazonaws.com"
}
}
}
]
}
~~It's possible this issue is with how I'm using ray as currently my ray.remote function calls the ray_dask_get scheduler meaning the remote job tries to create more remote jobs on the ray cluster. Though this is a strange error if that is indeed the issue. I can adjust my script so that the parent job is performed locally instead of on the cluster and see if that works.~~ The issue seems to occur regardless (i.e. without the recurrent remote calls).
I just validated that I get the same error when trying to read from S3 on my jupyterhub deployment, despite following the guide and attaching arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess to jupyterhub_single_user_irsa. Is there a policy I first need to attach to my AWS account to allow sts:AssumeRoleWithWebIdentity to work? I'll look through the blueprints/documentation again to see if I missed something.
I realize I've potentially gotten off topic from my original question. @askulkarni2 answered the question in theory. You're welcome to close the issue or leave it open for task planning the ray blueprint update. I will create a new issue with the bug I'm seeing and try to create a minimal code reproduction.