aws-iam-authenticator icon indicating copy to clipboard operation
aws-iam-authenticator copied to clipboard

Keep original aws-auth configuration when the new value cannot be parsed

Open suzaku opened this issue 2 years ago • 26 comments

In the current implementation, a value like rokeMappings is always reinitialized to an empty slice for any Added or Modified event:

roleMappings = make([]config.RoleMapping, 0)

And then, even when there's parse errors detected, it's passed to saveMap with other probably valid data, and inside saveMap the field ms.roles is, again, always reinitialized to an empty slice.

It took only one typo in aws-auth to break a whole EKS cluster

We have introduced an error in the mapRoles configuration recently like the following:

time="2023-06-13T09:35:43Z" level=warning **msg="Errors parsing configmap: [yaml: line 13: did not find expected key]"**

And this change alone broke the whole EKS cluster because, as explained above, ms.roles is always reinitialized to an empty slice and now even the nodes in cluster get access denied because of the ARN is not mapped error.

To make EKS clusters less fragile, we should keep the original configuration if the new one is malformed.

suzaku avatar Jun 20 '23 03:06 suzaku

Welcome @suzaku!

It looks like this is your first PR to kubernetes-sigs/aws-iam-authenticator 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/aws-iam-authenticator has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar Jun 20 '23 03:06 k8s-ci-robot

Hi @suzaku. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 20 '23 03:06 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: suzaku Once this PR has been reviewed and has the lgtm label, please assign wongma7 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jun 20 '23 03:06 k8s-ci-robot

@nckturner PTAL

suzaku avatar Jun 20 '23 05:06 suzaku

/assign @nckturner

suzaku avatar Jun 26 '23 02:06 suzaku

/test

nnmin-aws avatar Jun 29 '23 01:06 nnmin-aws

@nnmin-aws: The /test command needs one or more targets. The following commands are available to trigger required jobs:

  • /test pull-aws-iam-authenticator-e2e
  • /test pull-aws-iam-authenticator-e2e-kind
  • /test pull-aws-iam-authenticator-integration
  • /test pull-aws-iam-authenticator-unit

Use /test all to run all jobs.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 29 '23 01:06 k8s-ci-robot

/test all

nnmin-aws avatar Jun 29 '23 01:06 nnmin-aws

@suzaku: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-aws-iam-authenticator-unit 2c32a6ca88ddc18789b0bd43a75ca647a5f772e8 link true /test pull-aws-iam-authenticator-unit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Jun 29 '23 02:06 k8s-ci-robot

/retest

suzaku avatar Jun 29 '23 05:06 suzaku

@suzaku: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 29 '23 05:06 k8s-ci-robot

From a functionality perspective, how useful is it to allow authenticator to keep running on stale value? When the authenticator process/pod restarts, the new authenticator cannot really parse the values and eventually fail. If there is a problem, authenticator should rather fail immediately. Can you describe a bit about how the change handles authenticator restarts?

Also, can you specify which EKS version is being used?

jyotimahapatra avatar Jun 29 '23 06:06 jyotimahapatra

From a functionality perspective, how useful is it to allow authenticator to keep running on stale value? When the authenticator process/pod restarts, the new authenticator cannot really parse the values and eventually fail. If there is a problem, authenticator should rather fail immediately. Can you describe a bit about how the change handles authenticator restarts?

If an essential component like authenticator fails as I described, you can't even use kubectl because all access will be denied. For example, someone might accidentally introduce extra whitespace when adding a new role, I think it's much safer to keep the original value instead of suddenly disconnecting all kubelets with the API server by clearing the existing values.

There are cases where we should fail immediately, for example, when writing a command line program, you can validate the user input and fail immediately with some meaningful message, but authenticator configuration update is not such a scenario IMHO.

I agree that on restarts, the version with my change would still read the incorrect configuration and the roles would be empty because there's no original value to keep, but it's the same case with the current implementation, right?

Also, can you specify which EKS version is being used?

It's eks.7 with Kubernetes 1.24.

suzaku avatar Jun 29 '23 09:06 suzaku

suddenly disconnecting all kubelets with the API server by clearing the existing values

I reason with this but this could go both ways. We can choose to let the older values still work, but then there is no indication to the user that a mistake has been made. Finding the error abruptly later is only delaying it.

There may be other usecases related to security that may have to be evaluated before letting stale values be allowed. EKS is trying out a webhook to disallow such erroneous updates to the configmap. Could I request you to provide the cluster arn here or in a support ticket where you are able to still update an erroneous change to the configmap? The EKS team will look into the reason why the webhook is not working as expected.

jyotimahapatra avatar Jun 29 '23 16:06 jyotimahapatra

suddenly disconnecting all kubelets with the API server by clearing the existing values

I reason with this but this could go both ways. We can choose to let the older values still work, but then there is no indication to the user that a mistake has been made. Finding the error abruptly later is only delaying it.

There may be other usecases related to security that may have to be evaluated before letting stale values be allowed. EKS is trying out a webhook to disallow such erroneous updates to the configmap. Could I request you to provide the cluster arn here or in a support ticket where you are able to still update an erroneous change to the configmap? The EKS team will look into the reason why the webhook is not working as expected.

Since ConfigMap itself doesn't prevent such updates, it would be great if EKS can provide such validations and prevent dirty data in the first place. But it seems to not be working correctly, after understanding the root cause of the incident, one of our SRE engineers tried to create a new EKS cluster and updatedaws-auth to include an invalid map-roles and he managed to reproduce the issue. Can you try creating a new EKS cluster and see if you can successfully update the map-roles section of aws-auth to include invalid YAML data?

suzaku avatar Jun 29 '23 22:06 suzaku

Can you try creating a new EKS cluster and see if you can successfully update the map-roles section of aws-auth to include invalid YAML data?

Can i have a sample yaml that fails. The webhook just uses this function from authenticator to validate. Its not very advanced because of backward compat considerations. We're handling the most common errors we see, which will not break live clusters.

jyotimahapatra avatar Jul 01 '23 02:07 jyotimahapatra

Can you try creating a new EKS cluster and see if you can successfully update the map-roles section of aws-auth to include invalid YAML data?

Can i have a sample yaml that fails. The webhook just uses this function from authenticator to validate. Its not very advanced because of backward compat considerations. We're handling the most common errors we see, which will not break live clusters.

Cool, I think calling ParseMap is good enough for detecting malformed YAML contents. I guess my colleague used eks.7 for the testing cluster, do you know in which version of EKS is the webhook introduced?

suzaku avatar Jul 02 '23 01:07 suzaku

do you know in which version of EKS is the webhook introduced?

The webhook is added for new clusters. We'll make the changes next week such that its added for existing clusters too. I did check by creating a new cluster that the webhook will help prevent the error condition you're facing.

jyotimahapatra avatar Jul 03 '23 17:07 jyotimahapatra

do you know in which version of EKS is the webhook introduced?

The webhook is added for new clusters. We'll make the changes next week such that its added for existing clusters too. I did check by creating a new cluster that the webhook will help prevent the error condition you're facing.

Do you mean clusters with latest version or any newly created EKS?

suzaku avatar Jul 04 '23 00:07 suzaku

Do you mean clusters with latest version or any newly created EKS?

Should be available for any newly created EKS 1.22+ cluster. (I tested on 1.26 cluster). Let me know how it goes

jyotimahapatra avatar Jul 04 '23 00:07 jyotimahapatra

@jyotimahapatra I have a question even though the clusters are old, we keep updating them with the latest aws eks versions; at what point does this webhook being bootstrapped with the new clusters?

As per your comments, new clusters will have the webhook in place; we replicated the same issue in Version 1.24 in a new ephemeral cluster.

We are happy to discuss more on this issue.

mohanpedala avatar Jul 06 '23 15:07 mohanpedala

@jyotimahapatra I have a question even though the clusters are old, we keep updating them with the latest aws eks versions; at what point does this webhook being bootstrapped with the new clusters?

As per your comments, new clusters will have the webhook in place; we replicated the same issue in Version 1.24 in a new ephemeral cluster.

We are happy to discuss more on this issue.

Can you please share the steps to reproduce this? What exact changes are you making to the configmap?

ivelichkovich avatar Jul 10 '23 18:07 ivelichkovich

@jyotimahapatra I have a question even though the clusters are old, we keep updating them with the latest aws eks versions; at what point does this webhook being bootstrapped with the new clusters? As per your comments, new clusters will have the webhook in place; we replicated the same issue in Version 1.24 in a new ephemeral cluster. We are happy to discuss more on this issue.

Can you please share the steps to reproduce this? What exact changes are you making to the configmap?

  • We had a usecase where we had to add a role to the configmap, In that process we accidentally deleted a letter in the key and saved the config

To replicate the issue, here are the steps.

  • If you are spinning a new cluster, you have to remove the validatingwebhookconfiguration eks-aws-auth-configmap-validation-webhook

  • If you are running on a cluster older than 3 to 4 weeks from the date of this comment. You most probably won't see the eks-aws-auth-configmap-validation-webhook

  • Edit aws-auth configmap - See the below example ConfigMap the error is present in line 13 key "rolearn"

apiVersion: v1
data:
  mapAccounts: |
    []
  mapRoles: |
    - "groups":
      - "system:bootstrappers"
      - "system:nodes"
      "olearn": "arn:aws:iam::xxxx:role/node-role"
      "username": "system:node:{{EC2PrivateDNSName}}"
    - "groups":
      - "system:masters"
      "rolearn": "arn:aws:iam::xxxxx:role/AWSReservedSSO_AdministratorAccess_1234567890"
      "username": "aws_sso_admin"
  mapUsers: |
    []
immutable: false
kind: ConfigMap
metadata:
  creationTimestamp: "2023-07-07T15:50:37Z"
  labels:
    app.kubernetes.io/managed-by: Terraform
    terraform.io/module: terraform-aws-modules.eks.aws
  name: aws-auth
  namespace: kube-system

mohanpedala avatar Jul 12 '23 06:07 mohanpedala

@jyotimahapatra I have a question even though the clusters are old, we keep updating them with the latest aws eks versions; at what point does this webhook being bootstrapped with the new clusters?

As per your comments, new clusters will have the webhook in place; we replicated the same issue in Version 1.24 in a new ephemeral cluster.

We are happy to discuss more on this issue.

@jyotimahapatra The cluster I created was a week before your change was introduced, I can now see the webhook in all newly created clusters.

mohanpedala avatar Jul 12 '23 06:07 mohanpedala

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 05 '23 18:08 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 20 '24 21:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 19 '24 21:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Mar 20 '24 22:03 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 20 '24 22:03 k8s-ci-robot