data-on-eks icon indicating copy to clipboard operation
data-on-eks copied to clipboard

Granting S3 access to karpenter nodes

Open elyall opened this issue 2 years ago • 6 comments

  • [x] ✋ I have searched the open/closed issues and my issue is not listed.

Please describe your question here

Thanks for the great examples! I altered the jupyterhub on eks example (for a private cluster accessed via a Tailscale VPN) and I'm now adding a ray cluster and trying to grant S3 access to the jobs running on karpenter nodes. I was trying to use the same karpenter provisioners but how do I grant the jobs S3 access?

  • The ray example uses the terraform-aws-modules/eks/aws//modules/karpenter module and attaches the relevant policies via the iam_role_additional_policies argument which is pretty straightforward.
  • The jupyterhub example (which I currently have running) uses aws-ia/eks-blueprints-addons/aws which ultimately uses aws-ia/eks-blueprints-addon/aws. The two things I've tried that hasn't worked is:
  1. attaching the relevant policies via the role_policies input
karpenter = {
    role_policies = {
      bucket1_get_policy = bucket1_get_policy_arn
      bucket2_get_policy = bucket2_get_policy_arn
    }
  }
  1. using aws_iam_role_policy_attachment resources with role = module.eks_blueprints_addons.karpenter.iam_role_name.
resource "aws_iam_role_policy_attachment" "karpenter_s3_access" {
  for_each = toset([
    bucket1_get_policy_arn,
    bucket2_get_policy_arn,
  ])
  role       = module.eks_blueprints_addons.karpenter.node_instance_profile_name
  policy_arn = each.value
}

Also is there a preference for which module to use?

Provide a link to the example/module related to the question

jupyterhub ray

Additional context

I may just follow the ray example and generate karpenter resources outside of aws-ia/eks-blueprints-addons/aws.

Also here's the policy I'm attaching:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Download",
            "Effect": "Allow",
            "Action": [
                "s3:List*",
                "s3:Get*"
            ],
            "Resource": [
                "${bucket_arn}",
                "${bucket_arn}/*"
            ]
        },
        {
            "Sid": "Decrypt",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": [
                "${kms_key_arn}"
            ]
        }
    ]
}

And the error I'm getting is:

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

elyall avatar Oct 06 '23 00:10 elyall

Thanks for the great examples! I altered the jupyterhub on eks example (for a private cluster accessed via a Tailscale VPN) and I'm now adding a ray cluster and trying to grant S3 access to the jobs running on karpenter nodes.

Amazing! Tailscale is a great addition.

While we did show S3 access via the karpenter module's iam_role_additional_policies IIRC it was done so because RayCluster helm chart at the time did not have support for specifying a serviceAccountName as value. I just looked and it seems they have now added support for it, i.e. you can now specify head.serviceAccountName and worker.serviceAccountName as helm values which is great. This enables us to use an IAM Roles for Service Accounts which is demonstrated in the JupyterHub example. This would be the preferred away over the karpenter node role as this would restrict access to S3 buckets only to the RayCluster pods.

I will update the ray blueprint soon but you can go about it on your own as it is shown for the JupyterHub example here ... https://github.com/awslabs/data-on-eks/blob/44cb0769afc752e57bfb2d11192ebcec1ce97389/ai-ml/jupyterhub/jupyterhub.tf#L4-L48

Then use this serviceAccountName as value in the RayCluster helm chart (if you are using helm) or directly in RayCluster.yaml.

Optionally you can use our aws-ia/eks-blueprints-addons/aws module which we have provided as a convenience if you want to avoid some of the boiler plate code to create the helm_release and IRSA in a single shot (this is what I will use).

HTH, and please let us know if you run into any issues.

askulkarni2 avatar Oct 06 '23 06:10 askulkarni2

This sounds fantastic! When I tried implementing it I get the following error:

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity

I created an additional policy and attached it to the ray_single_user_irsa but it results in the same error. Am I attaching it to the wrong role?

Here's my code:
resource "kubernetes_namespace" "ray" {
  metadata {
    name = "ray"
  }
}

module "ray_single_user_irsa" {
  source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"

  role_name = "${data.terraform_remote_state.eks.outputs.cluster_name}-ray-single-user-sa"

  role_policy_arns = {
    bucket1_get_policy = bucket1_get_policy_arn
    bucket2_get_policy = bucket2_get_policy_arn
    sts_policy         = module.ray_policy.arn
  }

  oidc_providers = {
    main = {
      provider_arn               = data.terraform_remote_state.eks.outputs.oidc_provider_arn
      namespace_service_accounts = ["${kubernetes_namespace.ray.metadata[0].name}:ray-single-user"]
    }
  }
}

resource "kubernetes_service_account_v1" "ray_single_user_sa" {
  metadata {
    name        = "${data.terraform_remote_state.eks.outputs.cluster_name}-ray-single-user"
    namespace   = kubernetes_namespace.ray.metadata[0].name
    annotations = { "eks.amazonaws.com/role-arn" : module.ray_single_user_irsa.iam_role_arn }
  }

  automount_service_account_token = true
}

resource "kubernetes_secret_v1" "ray_single_user" {
  metadata {
    name      = "${data.terraform_remote_state.eks.outputs.cluster_name}-ray-single-user-secret"
    namespace = kubernetes_namespace.ray.metadata[0].name
    annotations = {
      "kubernetes.io/service-account.name"      = kubernetes_service_account_v1.ray_single_user_sa.metadata[0].name
      "kubernetes.io/service-account.namespace" = kubernetes_namespace.ray.metadata[0].name
    }
  }

  type = "kubernetes.io/service-account-token"
}

module "ray_policy" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-policy"
  version = "~> 5.20"

  name        = "RayPolicy"
  description = "IAM Policy to allow ray to function"

  policy = jsonencode(
    {
      Version = "2012-10-17"
      Statement = [
        {
          Sid      = "AssumeRoleWithWebIdentity"
          Effect   = "Allow"
          Action   = ["sts:AssumeRoleWithWebIdentity"]
          Resource = ["*"]
        },
      ]
    }
  )
}
The full error:
ray::convert_dataset() (pid=458, ip=100.64.160.5)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/working_dir_files/_ray_pkg_20d44c85bf820c86/rb_analysis/rb/images/ngff.py", line 690, in convert_dataset
    if output_path.exists():
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/cloudpath.py", line 389, in exists
    return self.client._exists(self)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 179, in _exists
    return self._s3_file_query(cloud_path) is not None
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 197, in _s3_file_query
    return next(
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 198, in <genexpr>
    (
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/boto3/resources/collection.py", line 81, in __iter__
    for page in self.pages():
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/boto3/resources/collection.py", line 171, in pages
    for page in pages:
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/paginate.py", line 269, in __iter__
    response = self._make_request(current_kwargs)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/paginate.py", line 357, in _make_request
    return self._method(**current_kwargs)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 534, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 959, in _make_api_call
    http, parsed_response = self._make_request(
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 982, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/endpoint.py", line 134, in create_request
    self._event_emitter.emit(
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/signers.py", line 105, in handler
    return self.sign(operation_name, request)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/signers.py", line 180, in sign
    auth = self.get_auth_instance(**kwargs)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/signers.py", line 284, in get_auth_instance
    frozen_credentials = self._credentials.get_frozen_credentials()
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 610, in get_frozen_credentials
    self._refresh()
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 498, in _refresh
    self._protected_refresh(is_mandatory=is_mandatory_refresh)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 514, in _protected_refresh
    metadata = self._refresh_using()
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 661, in fetch_credentials
    return self._get_cached_credentials()
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 671, in _get_cached_credentials
    response = self._get_credentials()
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 905, in _get_credentials
    return client.assume_role_with_web_identity(**kwargs)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 534, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 976, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity

I've also passed kubernetes_service_account_v1.ray_single_user_sa.metadata[0].name to both head.serviceAccountName and worker.serviceAccountName.

elyall avatar Oct 06 '23 23:10 elyall

It looks like the role is mounted correctly:

❯ kubectl -n ray exec -it ray-cluster-cpu-kuberay-head-5ktw2 -- env | grep AWS
Defaulted container "ray-head" out of: ray-head, autoscaler
AWS_STS_REGIONAL_ENDPOINTS=regional
AWS_REGION=us-west-2
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_DEFAULT_REGION=us-west-2
AWS_ROLE_ARN=arn:aws:iam::XXXXXXXXXX:role/eks-stage-ray-single-user-sa
Here's Jupyterhub's for reference:
❯ kubectl -n jupyterhub exec -it jupyter-evan -- env | grep AWS
Defaulted container "notebook" out of: notebook, block-cloud-metadata (init)
AWS_DEFAULT_REGION=us-west-2
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_STS_REGIONAL_ENDPOINTS=regional
AWS_ROLE_ARN=arn:aws:iam::XXXXXXXXXX:role/eks-stage-jupyterhub-single-user-sa
AWS_REGION=us-west-2

Also here's the trust relationship for the role via the aws console:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::XXXXXXXXXX:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX:sub": "system:serviceaccount:ray:ray-single-user",
                    "oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX:aud": "sts.amazonaws.com"
                }
            }
        }
    ]
}

~~It's possible this issue is with how I'm using ray as currently my ray.remote function calls the ray_dask_get scheduler meaning the remote job tries to create more remote jobs on the ray cluster. Though this is a strange error if that is indeed the issue. I can adjust my script so that the parent job is performed locally instead of on the cluster and see if that works.~~ The issue seems to occur regardless (i.e. without the recurrent remote calls).

elyall avatar Oct 09 '23 18:10 elyall

I just validated that I get the same error when trying to read from S3 on my jupyterhub deployment, despite following the guide and attaching arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess to jupyterhub_single_user_irsa. Is there a policy I first need to attach to my AWS account to allow sts:AssumeRoleWithWebIdentity to work? I'll look through the blueprints/documentation again to see if I missed something.

elyall avatar Oct 11 '23 18:10 elyall

I realize I've potentially gotten off topic from my original question. @askulkarni2 answered the question in theory. You're welcome to close the issue or leave it open for task planning the ray blueprint update. I will create a new issue with the bug I'm seeing and try to create a minimal code reproduction.

elyall avatar Oct 11 '23 23:10 elyall