aws-iam-authenticator
aws-iam-authenticator copied to clipboard
Keep original aws-auth configuration when the new value cannot be parsed
In the current implementation, a value like rokeMappings is always
reinitialized to an empty slice for any Added or Modified event:
roleMappings = make([]config.RoleMapping, 0)
And then, even when there's parse errors detected, it's passed to
saveMap with other probably valid data, and inside saveMap the field
ms.roles is, again, always reinitialized to an empty slice.
It took only one typo in aws-auth to break a whole EKS cluster
We have introduced an error in the mapRoles configuration recently
like the following:
time="2023-06-13T09:35:43Z" level=warning **msg="Errors parsing configmap: [yaml: line 13: did not find expected key]"**
And this change alone broke the whole EKS cluster because, as explained
above, ms.roles is always reinitialized to an empty slice and now even
the nodes in cluster get access denied because of the ARN is not mapped error.
To make EKS clusters less fragile, we should keep the original configuration if the new one is malformed.
Welcome @suzaku!
It looks like this is your first PR to kubernetes-sigs/aws-iam-authenticator 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.
You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.
You can also check if kubernetes-sigs/aws-iam-authenticator has its own contribution guidelines.
You may want to refer to our testing guide if you run into trouble with your tests not passing.
If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!
Thank you, and welcome to Kubernetes. :smiley:
Hi @suzaku. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: suzaku Once this PR has been reviewed and has the lgtm label, please assign wongma7 for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@nckturner PTAL
/assign @nckturner
/test
@nnmin-aws: The /test command needs one or more targets.
The following commands are available to trigger required jobs:
/test pull-aws-iam-authenticator-e2e/test pull-aws-iam-authenticator-e2e-kind/test pull-aws-iam-authenticator-integration/test pull-aws-iam-authenticator-unit
Use /test all to run all jobs.
In response to this:
/test
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/test all
@suzaku: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-aws-iam-authenticator-unit | 2c32a6ca88ddc18789b0bd43a75ca647a5f772e8 | link | true | /test pull-aws-iam-authenticator-unit |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
/retest
@suzaku: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.
In response to this:
/retest
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
From a functionality perspective, how useful is it to allow authenticator to keep running on stale value? When the authenticator process/pod restarts, the new authenticator cannot really parse the values and eventually fail. If there is a problem, authenticator should rather fail immediately. Can you describe a bit about how the change handles authenticator restarts?
Also, can you specify which EKS version is being used?
From a functionality perspective, how useful is it to allow authenticator to keep running on stale value? When the authenticator process/pod restarts, the new authenticator cannot really parse the values and eventually fail. If there is a problem, authenticator should rather fail immediately. Can you describe a bit about how the change handles authenticator restarts?
If an essential component like authenticator fails as I described, you can't even use kubectl because all access will be denied.
For example, someone might accidentally introduce extra whitespace when adding a new role, I think it's much safer to keep the original value instead of suddenly disconnecting all kubelets with the API server by clearing the existing values.
There are cases where we should fail immediately, for example, when writing a command line program, you can validate the user input and fail immediately with some meaningful message, but authenticator configuration update is not such a scenario IMHO.
I agree that on restarts, the version with my change would still read the incorrect configuration and the roles would be empty because there's no original value to keep, but it's the same case with the current implementation, right?
Also, can you specify which EKS version is being used?
It's eks.7 with Kubernetes 1.24.
suddenly disconnecting all kubelets with the API server by clearing the existing values
I reason with this but this could go both ways. We can choose to let the older values still work, but then there is no indication to the user that a mistake has been made. Finding the error abruptly later is only delaying it.
There may be other usecases related to security that may have to be evaluated before letting stale values be allowed. EKS is trying out a webhook to disallow such erroneous updates to the configmap. Could I request you to provide the cluster arn here or in a support ticket where you are able to still update an erroneous change to the configmap? The EKS team will look into the reason why the webhook is not working as expected.
suddenly disconnecting all kubelets with the API server by clearing the existing values
I reason with this but this could go both ways. We can choose to let the older values still work, but then there is no indication to the user that a mistake has been made. Finding the error abruptly later is only delaying it.
There may be other usecases related to security that may have to be evaluated before letting stale values be allowed. EKS is trying out a webhook to disallow such erroneous updates to the configmap. Could I request you to provide the cluster arn here or in a support ticket where you are able to still update an erroneous change to the configmap? The EKS team will look into the reason why the webhook is not working as expected.
Since ConfigMap itself doesn't prevent such updates, it would be great if EKS can provide such validations and prevent dirty data in the first place.
But it seems to not be working correctly, after understanding the root cause of the incident, one of our SRE engineers tried to create a new EKS cluster and updatedaws-auth to include an invalid map-roles and he managed to reproduce the issue.
Can you try creating a new EKS cluster and see if you can successfully update the map-roles section of aws-auth to include invalid YAML data?
Can you try creating a new EKS cluster and see if you can successfully update the map-roles section of aws-auth to include invalid YAML data?
Can i have a sample yaml that fails. The webhook just uses this function from authenticator to validate. Its not very advanced because of backward compat considerations. We're handling the most common errors we see, which will not break live clusters.
Can you try creating a new EKS cluster and see if you can successfully update the map-roles section of aws-auth to include invalid YAML data?
Can i have a sample yaml that fails. The webhook just uses this function from authenticator to validate. Its not very advanced because of backward compat considerations. We're handling the most common errors we see, which will not break live clusters.
Cool, I think calling ParseMap is good enough for detecting malformed YAML contents.
I guess my colleague used eks.7 for the testing cluster, do you know in which version of EKS is the webhook introduced?
do you know in which version of EKS is the webhook introduced?
The webhook is added for new clusters. We'll make the changes next week such that its added for existing clusters too. I did check by creating a new cluster that the webhook will help prevent the error condition you're facing.
do you know in which version of EKS is the webhook introduced?
The webhook is added for new clusters. We'll make the changes next week such that its added for existing clusters too. I did check by creating a new cluster that the webhook will help prevent the error condition you're facing.
Do you mean clusters with latest version or any newly created EKS?
Do you mean clusters with latest version or any newly created EKS?
Should be available for any newly created EKS 1.22+ cluster. (I tested on 1.26 cluster). Let me know how it goes
@jyotimahapatra I have a question even though the clusters are old, we keep updating them with the latest aws eks versions; at what point does this webhook being bootstrapped with the new clusters?
As per your comments, new clusters will have the webhook in place; we replicated the same issue in Version 1.24 in a new ephemeral cluster.
We are happy to discuss more on this issue.
@jyotimahapatra I have a question even though the clusters are old, we keep updating them with the latest aws eks versions; at what point does this webhook being bootstrapped with the new clusters?
As per your comments, new clusters will have the webhook in place; we replicated the same issue in Version 1.24 in a new ephemeral cluster.
We are happy to discuss more on this issue.
Can you please share the steps to reproduce this? What exact changes are you making to the configmap?
@jyotimahapatra I have a question even though the clusters are old, we keep updating them with the latest aws eks versions; at what point does this webhook being bootstrapped with the new clusters? As per your comments, new clusters will have the webhook in place; we replicated the same issue in Version 1.24 in a new ephemeral cluster. We are happy to discuss more on this issue.
Can you please share the steps to reproduce this? What exact changes are you making to the configmap?
- We had a usecase where we had to add a role to the configmap, In that process we accidentally deleted a letter in the key and saved the config
To replicate the issue, here are the steps.
-
If you are spinning a new cluster, you have to remove the validatingwebhookconfiguration eks-aws-auth-configmap-validation-webhook
-
If you are running on a cluster older than 3 to 4 weeks from the date of this comment. You most probably won't see the eks-aws-auth-configmap-validation-webhook
-
Edit aws-auth configmap - See the below example ConfigMap the error is present in line 13 key "rolearn"
apiVersion: v1
data:
mapAccounts: |
[]
mapRoles: |
- "groups":
- "system:bootstrappers"
- "system:nodes"
"olearn": "arn:aws:iam::xxxx:role/node-role"
"username": "system:node:{{EC2PrivateDNSName}}"
- "groups":
- "system:masters"
"rolearn": "arn:aws:iam::xxxxx:role/AWSReservedSSO_AdministratorAccess_1234567890"
"username": "aws_sso_admin"
mapUsers: |
[]
immutable: false
kind: ConfigMap
metadata:
creationTimestamp: "2023-07-07T15:50:37Z"
labels:
app.kubernetes.io/managed-by: Terraform
terraform.io/module: terraform-aws-modules.eks.aws
name: aws-auth
namespace: kube-system
@jyotimahapatra I have a question even though the clusters are old, we keep updating them with the latest aws eks versions; at what point does this webhook being bootstrapped with the new clusters?
As per your comments, new clusters will have the webhook in place; we replicated the same issue in Version 1.24 in a new ephemeral cluster.
We are happy to discuss more on this issue.
@jyotimahapatra The cluster I created was a week before your change was introduced, I can now see the webhook in all newly created clusters.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle rotten - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Reopen this PR with
/reopen - Mark this PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closedYou can:
- Reopen this PR with
/reopen- Mark this PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.