MGMT-18684: Ensure that bootstrap CSR is authenticated against cluster hosts.
Once the minimal control plane has been performed and consists of two nodes, we start the assisted-installer-controller pod and then reboot the bootstrap when this is ready. The bootstrap will attempt to join the control plane when it boots up. As part of this process, CSR's are submitted. Presently, we just assume that these are meant to be signed and blindly approve them.
This is problematic from a security perspective and could lead to an attack where, with careful timing, someone might be able to add a rogue node to the cluster. To prevent this, we should check that the hostname being presented in the CSR actually matches one of our known hosts.
To do this, requires two steps:
1: Prior to start of the assisted-installer-controller pod, we will save a list of known hosts in a config map called known-hosts, this will contain a single entry content which will be a newline separated list of hostnames
2: During the CSR validation, this will be read and used to verify that the CSR originates from a known host
@paul-maidment: This pull request references MGMT-18684 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.18.0" version, but no target version was set.
In response to this:
Once the minimal control plane has been performed and consists of two nodes, we start the assisted-installer-controller pod and then reboot the bootstrap when this is ready. The bootstrap will attempt to join the control plane when it boots up. As part of this process, CSR's are submitted. Presently, we just assume that these are meant to be signed and blindly approve them.
This is problematic from a security perspective and could lead to an attack where, with careful timing, someone might be able to add a rogue node to the cluster. To prevent this, we should check that the hostname being presented in the CSR actually matches one of our known hosts.
To do this, requires two steps:
1: Prior to start of the assisted-installer-controller pod, we will save a list of known hosts in a config map called
known-hosts, this will contain a single entrycontentwhich will be a newline separated list of hostnames 2: During the CSR validation, this will be read and used to verify that the CSR originates from a known host
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: paul-maidment
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [paul-maidment]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Codecov Report
Attention: Patch coverage is 50.00000% with 9 lines in your changes missing coverage. Please review.
Project coverage is 55.45%. Comparing base (
4635ddf) to head (a85a996). Report is 43 commits behind head on master.
:exclamation: Current head a85a996 differs from pull request most recent head 843614d
Please upload reports for the commit 843614d to get more accurate results.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/installer/installer.go | 50.00% | 6 Missing and 3 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #895 +/- ##
==========================================
- Coverage 55.70% 55.45% -0.25%
==========================================
Files 15 15
Lines 3208 3226 +18
==========================================
+ Hits 1787 1789 +2
- Misses 1249 1262 +13
- Partials 172 175 +3
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/installer/installer.go | 67.35% <50.00%> (-1.79%) |
:arrow_down: |
@paul-maidment: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/edge-unit-test | a85a9968329743f34d258967ccd8f03cb02b372c | link | true | /test edge-unit-test |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
/hold Since it's trivial to create CSRs with any hostname, this doesn't improve security
/hold Since it's trivial to create CSRs with any hostname, this doesn't improve security
Thank you for your response Omer. The PR is still in draft as I am contemplating how to truly solve this issue.
I accept the point about this doing little to nothing to advance security as I can see that you can't be sure that the subject of the CSR is exactly the host identified.
However, thinking carefully about the original request in MGMT-18684 I am not actually sure we even have a security issue in the first place or at least the scope of the problem is limited and a lot less critical than we might think.
"This means that there is a window during installation when a rogue node can join the cluster with no verification. "
is partially true but we need to consider how a CSR can be registered with the cluster
- by an authenticated user of the control plane who has been granted permission to create CSRs ( https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#authorization )
This means that they
- Need a Kubeconfig
- Need to have authority to create CSRs
Doesn't this mean that the joining node needs to satisfy these requirements when registering the CSR?
As identified in https://issues.redhat.com/browse/MGMT-18684, it's possible that the user may register a CSR for a subject that is not a node and automatically receive approval for it. Ideally we should prevent this.
If so then we only really need to check in a similar fashion to how the cluster-machine-operator does already... https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L364
Granted we would probably be best to forgo the "pre reboot" information gathering phase that I do in this PR and instead rely on DNS records in the cluster post reboot.
This means that they
- Need a Kubeconfig
- Need to have authority to create CSRs
Doesn't this mean that the joining node needs to satisfy these requirements when registering the CSR?
-
"Need a kubconfig" - all you need is the network connectivity and the IP address of the API server.
kubeconfigis "needed" only to makeocwork, but users can just as wellcurlthe API server or use whatever automation they want, it's just HTTP requests. Also MCS will give you this kubeconfig for free, unauthenticated when you query it for ignition inside the ignition -
Pretty sure in OCP anyone is authorized to create CSRs, this is how completely novel nodes without any credentials or information about the cluster are able to create CSRs that allow them to join. CSR approvals are the only line of defense we have against rogue nodes fully joining the cluster as far as my understanding goes
This means that they
- Need a Kubeconfig
- Need to have authority to create CSRs
Doesn't this mean that the joining node needs to satisfy these requirements when registering the CSR?
1. "Need a kubconfig" - all you need is the network connectivity and the IP address of the API server. `kubeconfig` is "needed" only to make `oc` work, but users can just as well `curl` the API server or use whatever automation they want, it's just HTTP requests. Also MCS will give you this kubeconfig for free, unauthenticated when you query it for ignition inside the ignition 2. Pretty sure in OCP anyone is authorized to create CSRs, this is how completely novel nodes without any credentials or information about the cluster are able to create CSRs that allow them to join. CSR _approvals_ are the only line of defense we have against rogue nodes fully joining the cluster as far as my understanding goes
What are your thoughts on this implementation here?
https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L364
The checks that we perform to ensure that the node with a specific name
- Does not already exist
- Is resolvable to an IP within the cluster DNS on the cluster side.
- Resolves to a known machine.
Are these sufficient to prevent a rogue node from joining or do they still fall short? Is cluster-machine-approver secure enough to be trusted or is it prone to the same issues you highlight?
Is it still viable that an attacker with the right kind of static IP setup and timing could add a rogue node? I think maybe, but certainly a lot harder to pull off...
if an attack on cluster-machine-approver is possible, does this mean we have a broader issue?
Also MCS will give you this kubeconfig for free, unauthenticated when you query it for ignition inside the ignition
This sounds like a sizeable security hole, much worse than the timing issue as if I have the KubeConfig I have full control, I don't need to play games to add nodes, I can just do it...
"Pretty sure in OCP anyone is authorized to create CSRs, this is how completely novel nodes without any credentials or information about the cluster are able to create CSRs that allow them to join. CSR approvals are the only line of defense we have against rogue nodes fully joining the cluster as far as my understanding goes"
I have tried various ways of using curl requests to create and list CSR's as an anonymous user, pretty sure that I am using the correct syntax (which oc will quite happily dump for you if you add a parameter like -v 1000000)
curl -v -k -XGET -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: oc/4.16.0 (linux/amd64) kubernetes/e8fb3c0" 'https://api.testcluster.multinode.local:6443/apis/certificates.k8s.io/v1/certificatesigningrequests?limit=500'
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"certificatesigningrequests.certificates.k8s.io is forbidden: User \"system:anonymous\" cannot list resource \"certificatesigningrequests\" in API group \"certificates.k8s.io\" at the cluster scope","reason":"Forbidden","details":{"group":"certificates.k8s.io","kind":"certificatesigningrequests"},"code":403}
I receive similar messages for creation attempts also. It seems that there is not an "anonymous" way to create a CSR over HTTPS, I could be mistaken but I think we should check this in detail. Do you know who might be able to verify this? One of the Node team perhaps?
These requests are being made against the "minimal" control plane that is created just before the bootstrap joins.
Are these sufficient to prevent a rogue node from joining or do they still fall short?
if an attack on cluster-machine-approver is possible, does this mean we have a broader issue?
What are your thoughts on this implementation here?
Not sure, seems rather insufficient. But it depends on the requirements for cluster-machine-approver and their threat model, I can't comment on the security of that component without having this context.
I can however comment on assisted-installer, which is supposed to be a user-friendly installer, not meant for power users, so the security requirements should be tight and fool-proof or come with a ton of UI warnings to help users protect themselves from our non-ideal behavior.
Of course your PR is not making anything worse, but if we're pretending to fix our security posture with regards to how we choose to approve CSRs, we should actually do that right and not use some weak measures like comparing easily fakeable fields on a CSR. I don't get the point of that.
Is resolvable to an IP within the cluster DNS on the cluster side
Is it still viable that an attacker with the right kind of static IP setup and timing could add a rogue node? I think maybe, but certainly a lot harder to pull off...
What does "Resolveable to an IP" mean here? If you're referring to the IP from which the request to create the CSR came from, then that definitely makes life harder for an attacker. But I'm not sure that information is even available outside of audit logs. If you're referring to the IP addresses specified in the CSR itself, anyone can set those to anything arbitrarily as far as I'm aware, just like they can set the hostname.
This sounds like a sizeable security hole, much worse than the timing issue as if I have the KubeConfig I have full control, I don't need to play games to add nodes, I can just do it...
It gives you a kubeconfig, not an admin kubeconfig. It's basically a YAML with a URL in it
I receive similar messages for creation attempts also. It seems that there is not an "anonymous" way to create a CSR over HTTPS, I could be mistaken but I think we should check this in detail.
There must be some way to do it, as this is what the kubelet on fresh nodes with 0 credentials does (all they have is RHCOS + the trivially publicly available ignition they got from MCS), so you're probably missing something.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten /remove-lifecycle stale
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@openshift-bot: Closed this PR.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen. Mark the issue as fresh by commenting/remove-lifecycle rotten. Exclude this issue from closing again by commenting/lifecycle frozen./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.