assisted-installer MGMT-18684: Ensure that bootstrap CSR is authenticated against cluster hosts.

Once the minimal control plane has been performed and consists of two nodes, we start the assisted-installer-controller pod and then reboot the bootstrap when this is ready. The bootstrap will attempt to join the control plane when it boots up. As part of this process, CSR's are submitted. Presently, we just assume that these are meant to be signed and blindly approve them.

This is problematic from a security perspective and could lead to an attack where, with careful timing, someone might be able to add a rogue node to the cluster. To prevent this, we should check that the hostname being presented in the CSR actually matches one of our known hosts.

To do this, requires two steps:

1: Prior to start of the assisted-installer-controller pod, we will save a list of known hosts in a config map called known-hosts, this will contain a single entry content which will be a newline separated list of hostnames 2: During the CSR validation, this will be read and used to verify that the CSR originates from a known host

Aug 29 '24 16:08 paul-maidment

@paul-maidment: This pull request references MGMT-18684 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.18.0" version, but no target version was set.

In response to this:

Once the minimal control plane has been performed and consists of two nodes, we start the assisted-installer-controller pod and then reboot the bootstrap when this is ready. The bootstrap will attempt to join the control plane when it boots up. As part of this process, CSR's are submitted. Presently, we just assume that these are meant to be signed and blindly approve them.

This is problematic from a security perspective and could lead to an attack where, with careful timing, someone might be able to add a rogue node to the cluster. To prevent this, we should check that the hostname being presented in the CSR actually matches one of our known hosts.

To do this, requires two steps:

1: Prior to start of the assisted-installer-controller pod, we will save a list of known hosts in a config map called known-hosts, this will contain a single entry content which will be a newline separated list of hostnames 2: During the CSR validation, this will be read and used to verify that the CSR originates from a known host

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Aug 29 '24 16:08 openshift-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: paul-maidment

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [paul-maidment]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Aug 29 '24 16:08 openshift-ci[bot]

Codecov Report

Attention: Patch coverage is 50.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 55.45%. Comparing base (4635ddf) to head (a85a996). Report is 43 commits behind head on master.

:exclamation: Current head a85a996 differs from pull request most recent head 843614d

Please upload reports for the commit 843614d to get more accurate results.

Files with missing lines	Patch %	Lines
src/installer/installer.go	50.00%	6 Missing and 3 partials :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #895      +/-   ##
==========================================
- Coverage   55.70%   55.45%   -0.25%     
==========================================
  Files          15       15              
  Lines        3208     3226      +18     
==========================================
+ Hits         1787     1789       +2     
- Misses       1249     1262      +13     
- Partials      172      175       +3

Files with missing lines	Coverage Δ
src/installer/installer.go	`67.35% <50.00%> (-1.79%)`	:arrow_down:

Aug 29 '24 16:08 codecov[bot]

@paul-maidment: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/edge-unit-test	a85a9968329743f34d258967ccd8f03cb02b372c	link	true	`/test edge-unit-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Aug 29 '24 18:08 openshift-ci[bot]

/hold Since it's trivial to create CSRs with any hostname, this doesn't improve security

Aug 30 '24 15:08 omertuc

/hold Since it's trivial to create CSRs with any hostname, this doesn't improve security

Thank you for your response Omer. The PR is still in draft as I am contemplating how to truly solve this issue.

I accept the point about this doing little to nothing to advance security as I can see that you can't be sure that the subject of the CSR is exactly the host identified.

However, thinking carefully about the original request in MGMT-18684 I am not actually sure we even have a security issue in the first place or at least the scope of the problem is limited and a lot less critical than we might think.

"This means that there is a window during installation when a rogue node can join the cluster with no verification. "

is partially true but we need to consider how a CSR can be registered with the cluster

by an authenticated user of the control plane who has been granted permission to create CSRs ( https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#authorization )

This means that they

Need a Kubeconfig
Need to have authority to create CSRs

Doesn't this mean that the joining node needs to satisfy these requirements when registering the CSR?

As identified in https://issues.redhat.com/browse/MGMT-18684, it's possible that the user may register a CSR for a subject that is not a node and automatically receive approval for it. Ideally we should prevent this.

If so then we only really need to check in a similar fashion to how the cluster-machine-operator does already... https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L364

Granted we would probably be best to forgo the "pre reboot" information gathering phase that I do in this PR and instead rely on DNS records in the cluster post reboot.

Sep 01 '24 07:09 paul-maidment

This means that they

Need a Kubeconfig

Need to have authority to create CSRs

Doesn't this mean that the joining node needs to satisfy these requirements when registering the CSR?

"Need a kubconfig" - all you need is the network connectivity and the IP address of the API server. kubeconfig is "needed" only to make oc work, but users can just as well curl the API server or use whatever automation they want, it's just HTTP requests. Also MCS will give you this kubeconfig for free, unauthenticated when you query it for ignition inside the ignition
Pretty sure in OCP anyone is authorized to create CSRs, this is how completely novel nodes without any credentials or information about the cluster are able to create CSRs that allow them to join. CSR approvals are the only line of defense we have against rogue nodes fully joining the cluster as far as my understanding goes

Sep 01 '24 09:09 omertuc

This means that they

Need a Kubeconfig

Need to have authority to create CSRs

Doesn't this mean that the joining node needs to satisfy these requirements when registering the CSR?

1. "Need a kubconfig" - all you need is the network connectivity and the IP address of the API server. `kubeconfig` is "needed" only to make `oc` work, but users can just as well `curl` the API server or use whatever automation they want, it's just HTTP requests. Also MCS will give you this kubeconfig for free, unauthenticated when you query it for ignition inside the ignition

2. Pretty sure in OCP anyone is authorized to create CSRs, this is how completely novel nodes without any credentials or information about the cluster are able to create CSRs that allow them to join. CSR _approvals_ are the only line of defense we have against rogue nodes fully joining the cluster as far as my understanding goes

What are your thoughts on this implementation here?

https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L364

The checks that we perform to ensure that the node with a specific name

Does not already exist
Is resolvable to an IP within the cluster DNS on the cluster side.
Resolves to a known machine.

Are these sufficient to prevent a rogue node from joining or do they still fall short? Is cluster-machine-approver secure enough to be trusted or is it prone to the same issues you highlight?

Is it still viable that an attacker with the right kind of static IP setup and timing could add a rogue node? I think maybe, but certainly a lot harder to pull off...

if an attack on cluster-machine-approver is possible, does this mean we have a broader issue?

 Also MCS will give you this kubeconfig for free, unauthenticated when you query it for ignition inside the ignition

This sounds like a sizeable security hole, much worse than the timing issue as if I have the KubeConfig I have full control, I don't need to play games to add nodes, I can just do it...

Sep 01 '24 13:09 paul-maidment

"Pretty sure in OCP anyone is authorized to create CSRs, this is how completely novel nodes without any credentials or information about the cluster are able to create CSRs that allow them to join. CSR approvals are the only line of defense we have against rogue nodes fully joining the cluster as far as my understanding goes"

I have tried various ways of using curl requests to create and list CSR's as an anonymous user, pretty sure that I am using the correct syntax (which oc will quite happily dump for you if you add a parameter like -v 1000000)

curl -v -k -XGET  -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: oc/4.16.0 (linux/amd64) kubernetes/e8fb3c0" 'https://api.testcluster.multinode.local:6443/apis/certificates.k8s.io/v1/certificatesigningrequests?limit=500'

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"certificatesigningrequests.certificates.k8s.io is forbidden: User \"system:anonymous\" cannot list resource \"certificatesigningrequests\" in API group \"certificates.k8s.io\" at the cluster scope","reason":"Forbidden","details":{"group":"certificates.k8s.io","kind":"certificatesigningrequests"},"code":403}

I receive similar messages for creation attempts also. It seems that there is not an "anonymous" way to create a CSR over HTTPS, I could be mistaken but I think we should check this in detail. Do you know who might be able to verify this? One of the Node team perhaps?

These requests are being made against the "minimal" control plane that is created just before the bootstrap joins.

Sep 01 '24 15:09 paul-maidment

Are these sufficient to prevent a rogue node from joining or do they still fall short?

if an attack on cluster-machine-approver is possible, does this mean we have a broader issue?

What are your thoughts on this implementation here?

Not sure, seems rather insufficient. But it depends on the requirements for cluster-machine-approver and their threat model, I can't comment on the security of that component without having this context.

I can however comment on assisted-installer, which is supposed to be a user-friendly installer, not meant for power users, so the security requirements should be tight and fool-proof or come with a ton of UI warnings to help users protect themselves from our non-ideal behavior.

Of course your PR is not making anything worse, but if we're pretending to fix our security posture with regards to how we choose to approve CSRs, we should actually do that right and not use some weak measures like comparing easily fakeable fields on a CSR. I don't get the point of that.

Is resolvable to an IP within the cluster DNS on the cluster side

Is it still viable that an attacker with the right kind of static IP setup and timing could add a rogue node? I think maybe, but certainly a lot harder to pull off...

What does "Resolveable to an IP" mean here? If you're referring to the IP from which the request to create the CSR came from, then that definitely makes life harder for an attacker. But I'm not sure that information is even available outside of audit logs. If you're referring to the IP addresses specified in the CSR itself, anyone can set those to anything arbitrarily as far as I'm aware, just like they can set the hostname.

This sounds like a sizeable security hole, much worse than the timing issue as if I have the KubeConfig I have full control, I don't need to play games to add nodes, I can just do it...

It gives you a kubeconfig, not an admin kubeconfig. It's basically a YAML with a URL in it

I receive similar messages for creation attempts also. It seems that there is not an "anonymous" way to create a CSR over HTTPS, I could be mistaken but I think we should check this in detail.

There must be some way to do it, as this is what the kubelet on fresh nodes with 0 credentials does (all they have is RHCOS + the trivially publicly available ignition they got from MCS), so you're probably missing something.

Sep 01 '24 15:09 omertuc

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Dec 08 '24 01:12 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

Jan 07 '25 08:01 openshift-bot

/remove-lifecycle stale

Jan 07 '25 13:01 paul-maidment

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Feb 07 '25 00:02 openshift-bot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Feb 07 '25 00:02 openshift-merge-robot

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Feb 07 '25 00:02 openshift-ci[bot]