community icon indicating copy to clipboard operation
community copied to clipboard

Sagemaker ACK Fails to update endpoint

Open mwm5945 opened this issue 1 year ago • 9 comments

Describe the bug Related to this issue in the CDK: https://github.com/aws/aws-cdk/issues/11594, it appears that updating an existing endpoint with a new Endpoint may require contradictory IAM permissions. Updating the endpointConfigName field in an existing endpoint yields this error for me:

  - message: "AccessDeniedException: User: arn:aws:sts::<acct omitted>:assumed-role/sagemaker-provisioner/kiam-kiam
      is not authorized to perform: sagemaker:UpdateEndpoint on resource: arn:aws:sagemaker:us-east-1:<account omitted>:endpoint-config/endpoint-config-name      because no identity-based policy allows the sagemaker:UpdateEndpoint action\n\tstatus
      code: 400, request id: <omitted> "
    status: "True"
    type: ACK.Recoverable

According to this doc, all UpdateEndpoint requires is to specify an endpoint name, which due to internal corporate policies is required. We are not able to add any EndpointConfigs to the policy due to the same policy.

Steps to reproduce

IAM policy scoped as much as possible:

        {
            "Sid": "endpoint",
            "Effect": "Allow",
            "Action": [
                "sagemaker:AddTags",
                "sagemaker:DeleteTags",
                "sagemaker:CreateEndpoint",
                "sagemaker:DeleteEndpoint",
                "sagemaker:DescribeEndpoint",
                "sagemaker:UpdateEndpoint",
                "sagemaker:UpdateEndpointWeightsAndCapacities"
            ],
            "Resource": [
                "arn:aws:sagemaker:us-east-1:ACCOUNT_NUM:endpoint/test-model",
            ]
        },
        {
            "Sid": "endpointCfg",
            "Effect": "Allow",
            "Action": [
                "sagemaker:AddTags",
                "sagemaker:DeleteTags",
                "sagemaker:CreateEndpointConfig",
                "sagemaker:CreateEndpoint",
                "sagemaker:DescribeEndpointConfig",
                "sagemaker:DeleteEndpointConfig"
            ],
            "Resource": [
                "arn:aws:sagemaker:us-east-1:ACCOUNT_NUM:endpoint-config/cfg1",
                "arn:aws:sagemaker:us-east-1:ACCOUNT_NUM:endpoint-config/cfg2"   
            ]
        },

Create the above resources, with the endpoint using cfg1, then try switching to cfg2 by updating the existing endpoint yaml.

Expected outcome A concise description of what you expected to happen.

Environment

  • Kubernetes version: 1.22.10
  • Using EKS (yes/no), if so version? no
  • AWS service targeted (S3, RDS, etc.) sagemaker

mwm5945 avatar Sep 05 '23 19:09 mwm5945

/cc @aws-controllers-k8s/sagemaker-maintainer

a-hilaly avatar Sep 06 '23 03:09 a-hilaly

Hi mwm5945, will attempt to replicate but have a couple questions:

  1. Which controller verison are you using?
  2. Is arn:aws:sts::<acct omitted>:assumed-role/sagemaker-provisioner/kiam-kiam the ack controller role or the execution role?
  3. Do you create/remove tags in the update?
  4. Does the error go away if you have sagemaker:UpdateEndpoint in the endpointCfg statement?

ananth102 avatar Sep 06 '23 22:09 ananth102

  1. 1.2.2
  2. Its the KIAM role that the ACK role has a trust relationship with (we're not on AKS, nor do we have the newer auth method setup yet)
  3. Nope!
  4. We're not able to do so--our internal corporate policies restrict adding this statement to endpint-configs, as it's not listed as an option here. I know doing this would work, as it worked previously, however there was a bug in the platform that handles policy validations, which is ultimately what caused this to be discovered.

Thanks!

mwm5945 avatar Sep 06 '23 22:09 mwm5945

Hi Micheal, We are checking with the service team on this issue

surajkota avatar Sep 07 '23 18:09 surajkota

Hi Micheal, I can confirm this is a documentation issue and sagemaker:updateEndpoint permission needs to be on the endpoint config resource as well. We will work with the documentation team to update the docs.

surajkota avatar Sep 11 '23 20:09 surajkota

Issues go stale after 180d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 60d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale

ack-bot avatar Mar 10 '24 01:03 ack-bot

/remove-lifecycle stale

gecube avatar Mar 13 '24 06:03 gecube