origin icon indicating copy to clipboard operation
origin copied to clipboard

TLS 1.3 / Modern profile tests

Open jacobsee opened this issue 8 months ago • 102 comments

This PR:

  • Removes a skip on an existing TLS test - it appears to have been a setup issue
  • Adds a test for ensuring that components are properly listening for only the expected TLS version or higher.

cc: @dusk125

jacobsee avatar Mar 19 '25 23:03 jacobsee

Supports OCPBUGS-37706, OCPSTRAT-1364

jacobsee avatar Mar 20 '25 06:03 jacobsee

Job Failure Risk Analysis for sha: 6c00d7595e29ced0457431e47381bbbd6bb7732d

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling High
[sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times
This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days.

openshift-trt[bot] avatar Mar 20 '25 11:03 openshift-trt[bot]

/lgtm

dusk125 avatar Mar 20 '25 18:03 dusk125

/retest-required

jacobsee avatar Mar 20 '25 20:03 jacobsee

Job Failure Risk Analysis for sha: ff621200d7324a17f2efb668a046241be92fd5dc

Job Name Failure Risk
pull-ci-openshift-origin-main-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback MissingData
pull-ci-openshift-origin-main-e2e-aws-disruptive Medium
[sig-node] static pods should start after being created
Potential external regression detected for High Risk Test analysis
pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling Low
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded
This test has passed 0.00% of 1 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.
---
[bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available
This test has passed 0.00% of 1 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.
---
[bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
This test has passed 0.00% of 1 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-gcp-disruptive Medium
[sig-node] static pods should start after being created
Potential external regression detected for High Risk Test analysis
---
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
Potential external regression detected for High Risk Test analysis
pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling Medium
[bz-etcd][invariant] alert/etcdMembersDown should not be at or above info
Potential external regression detected for High Risk Test analysis
---
[sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times
Potential external regression detected for High Risk Test analysis

openshift-trt[bot] avatar Mar 21 '25 01:03 openshift-trt[bot]

which jobs run the new test ? i have checked metal-ipi-serial, aws-ovn-microshift-serial and aws-ovn-serial but didn't find the new test on the list.

p0lyn0mial avatar Mar 21 '25 15:03 p0lyn0mial

Job Failure Risk Analysis for sha: b6d72fde3e33918c98d308205a1485ad12e5a221

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-disruptive Medium
[sig-node] static pods should start after being created
Potential external regression detected for High Risk Test analysis

openshift-trt[bot] avatar Mar 22 '25 01:03 openshift-trt[bot]

which jobs run the new test ? i have checked metal-ipi-serial, aws-ovn-microshift-serial and aws-ovn-serial but didn't find the new test on the list.

@jacobsee I think you need to run make update to actually "add" the test to the list of tests to be executed. For example: https://github.com/openshift/origin/pull/29191/commits

p0lyn0mial avatar Mar 24 '25 12:03 p0lyn0mial

@p0lyn0mial Thanks! I hadn't seen that. It's in there now, and I've realized that it should probably be marked [slow] as well. To your earlier comment, do we now still need to pick jobs to run this? This is my first brush with the origin tests, so I'm not sure what all is needed to get a new one plugged in correctly.

jacobsee avatar Mar 25 '25 06:03 jacobsee

Job Failure Risk Analysis for sha: 39e5f2297637306461c6bccc35c9c781c6829e13

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling Low
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded
This test has passed 50.00% of 2 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-gcp-disruptive High
[bz-Monitoring] clusteroperator/monitoring should not change condition/Degraded
This test has passed 99.30% of 5292 runs on release 4.19 [Overall] in the last week.
---
[bz-Monitoring] clusteroperator/monitoring should not change condition/Available
This test has passed 99.45% of 5292 runs on release 4.19 [Overall] in the last week.

openshift-trt[bot] avatar Mar 25 '25 13:03 openshift-trt[bot]

To your earlier comment, do we now still need to pick jobs to run this?

it looks like you could use /payload-job-with-prs to test your new test with the o/k PR, for example:

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn openshift/kubernetes#2135

xref: https://docs.ci.openshift.org/docs/release-oversight/pull-request-testing/#payload-job

we just need to find a periodic job that will run your test. maybe some serial test from ?

so, overall:

  1. we add a new test to o/o. (this PR)
  2. we use the/payload-job-with-prs command to run the test with the o/k PR
  3. we make sure the test is green
  4. we merge the o/k PR
  5. we merge this PR

p0lyn0mial avatar Mar 25 '25 14:03 p0lyn0mial

@p0lyn0mial: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/745c1ee0-0985-11f0-8a93-335e1efa7bdb-0

openshift-ci[bot] avatar Mar 25 '25 14:03 openshift-ci[bot]

@p0lyn0mial Addressed several things - however the wait for etcd to stabilize was a deliberate decision after a chat with @dusk125 however, so maybe we should bring that context to this thread here. The thinking was that it is also desirable to have a test for etcd's TLS 1.3 stability, and the test would have 90% the same code as this one (just with a different check in the middle), so we might we well ensure that both stabilize here to avoid duplicating an already-slow test.

jacobsee avatar Mar 31 '25 05:03 jacobsee

Job Failure Risk Analysis for sha: 76dc7ef5f57fe9ffedd977c9a8a1c35062660a81

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling High
[sig-architecture] platform pods in ns/openshift-cluster-csi-drivers should not exit an excessive amount of times
This test has passed 99.98% of 6449 runs on release 4.19 [Overall] in the last week.

openshift-trt[bot] avatar Mar 31 '25 11:03 openshift-trt[bot]

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn openshift/kubernetes#2135

jacobsee avatar Mar 31 '25 14:03 jacobsee

@jacobsee: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5ac351c0-0e39-11f0-9a0c-207cdc87222c-0

openshift-ci[bot] avatar Mar 31 '25 14:03 openshift-ci[bot]

@p0lyn0mial I might not be looking in the right place/way but I'm not seeing jobs that run a [serial][slow] test - could it be that it wouldn't get picked up by any? https://sippy.dptools.openshift.org/sippy-ng/tests/4.19?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522%255Bserial%255D%2522%257D%252C%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522%255Bslow%255D%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&pageSize=100&sort=asc&sortField=current_working_percentage

There are a few tests in this repository in a similar boat (like TestImportRepositoryFromBlockedRegistry) that I don't see runs for.

This new test works when I run it manually but do you have any tips for getting it integrated into existing jobs?

jacobsee avatar Apr 03 '25 05:04 jacobsee

@p0lyn0mial I might not be looking in the right place/way but I'm not seeing jobs that run a [serial][slow] test - could it be that it wouldn't get picked up by any? https://sippy.dptools.openshift.org/sippy-ng/tests/4.19?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522%255Bserial%255D%2522%257D%252C%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522%255Bslow%255D%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&pageSize=100&sort=asc&sortField=current_working_percentage

There are a few tests in this repository in a similar boat (like TestImportRepositoryFromBlockedRegistry) that I don't see runs for.

This new test works when I run it manually but do you have any tips for getting it integrated into existing jobs?

@wangke19 @bertinatto could you help with ^ ?

p0lyn0mial avatar Apr 03 '25 07:04 p0lyn0mial

@p0lyn0mial Addressed several things - however the wait for etcd to stabilize was a deliberate decision after a chat with @dusk125 however, so maybe we should bring that context to this thread here. The thinking was that it is also desirable to have a test for etcd's TLS 1.3 stability

WDYM ? I think that jobs in CI run monitors that ensure the overall stability of the cluster. Otherwise, each individual test would have to implement this functionality on its own.

p0lyn0mial avatar Apr 03 '25 07:04 p0lyn0mial

@jacobsee on second thought, maybe we should actually test more control plane components.

Here is a list of components that will be affected after changing the profile.

For instance, the aggregated APIs (OpenShift API server and OAuth API server) are behind the kube-apiserver (KAS). So, it's not enough to just check the connection to KAS. KAS has a built-in proxy that forwards requests to the aggregated APIs. To properly test the aggregated APIs and the proxy client in KAS, we would need to send an HTTPS request through KAS.

I believe it's worth the effort — do you agree? Should we also test the other components?

p0lyn0mial avatar Apr 03 '25 17:04 p0lyn0mial

@p0lyn0mial I do agree that testing other components makes sense if we anticipate an impact to them, unsure if it makes more sense to try to wrap it all into one PR upfront or just get start with something like this... Regarding that impact to other components, I tried to solicit some additional opinions as I haven't spent much time working with them and found out that what we're doing here may actually fall under the coverage of an existing test. I will track that down in the morning.

jacobsee avatar Apr 04 '25 06:04 jacobsee

@jacobsee after the profile change, only the minimum acceptable TLS version will be allowed. I think we should test the affected components and verify that the platform remains stable.

To test the components, it will probably be enough to send a TLS/HTTP2 request. To make sure we haven’t accidentally broken anything else, we’ll likely need to check if all operators—or at least the affected ones—aren’t reporting any errors. I’m not sure if we can rely on the existing monitors for this, or if we should write some custom code to check the status of the operators.

It seems to me that before we merge the code into o/k, we need to make sure the platform is functioning properly. Do you agree?

p0lyn0mial avatar Apr 06 '25 14:04 p0lyn0mial

To make it more concrete, I think we could:

  1. Identify the affected components after the profile change.
  2. Check whether version 1.3 is disallowed for each component.
  3. Update the profile.
  4. Wait for the KAS rollout — we could take a shortcut here and only wait for the KAS, as it takes the longest to roll out.
  5. Verify that version 1.3 is now allowed for each component. For most components, we can use the TLS client. The aggregated APIs might be a bit special — for the aggregated APIs, we need to construct the proper URL path so that the KAS proxy forwards traffic to the appropriate server responsible for that group.
  6. Read the operator resources of the affected components and ensure the operators are not degraded and are not progressing.

Initially, I thought we could rely on the monitors for step 6, but I'm not sure what exactly they check or whether all jobs actually run them. @dgoodwin maybe you can chime in on whether relying on the monitors would be enough?

p0lyn0mial avatar Apr 07 '25 10:04 p0lyn0mial

@p0lyn0mial A couple of thoughts:

The Kube API server (and the OAuth server, which I have now also added a check for) are easy to check because there is a route to those services from outside the cluster, where the origin tests will be running. For other components, do you have any tests or otherwise examples you can point me towards in our infrastructure where we need to access the cluster network to make requests directly against services that are not exposed outside the cluster, for testing the TLS dialing? Is the accepted answer there to set up a port forward to the test for each service?

I've also added a request that requires the aggregated OCP API to be working, but it is a regular request through a client, as I don't believe that tls.Dial with a path component is valid... I think we would need to hit the OCP API server from the test directly, not through the Kube API's proxy.

I've also added progressing/degraded checks for cluster operators that appeared to be affected.

jacobsee avatar Apr 08 '25 00:04 jacobsee

@jacobsee oh, you're right, the tests are run from outside the cluster.

Mhm, I don't have any examples. I think we could either use port-forward or schedule a pod with a custom binary that performs the check. Not sure ,maybe using exec.CommandContext(ctx, "oc", "port-forward", ...) would actually be easier.

Also, I believe other components expose an HTTPS endpoint for metrics and health/liveness probes. For probes, we could potentially rely on the kubelet and the StaticPodsDegraded condition (at least for the static pods). For metrics, I doubt the monitoring operator would go degraded just because it can’t scrape metrics. Maybe @dgrisonnet could chime in here? For monitoring, maybe we could simply try scraping the metrics ourselves.

But the more I think about how to test this change, the more I lean toward simply creating a periodic job that updates the profile, ensures the change has been observed by the affected components and then simply runs some (conformance) tests.

I’ve also started an internal thread about it. We might also need the test you created, at the very least, we should verify that the affected components actually picked up the new TLS profile.

p0lyn0mial avatar Apr 08 '25 12:04 p0lyn0mial

I don't think cluster-monitoring-operator would report degraded if a target is unavailable, but you could rely on the TargetDown alert.

I don't remember if we have a catch all for pods liveness, but you could always rely on kubelet's prober_probe_total metric.

dgrisonnet avatar Apr 10 '25 09:04 dgrisonnet

/lgtm

dusk125 avatar Apr 14 '25 19:04 dusk125

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn https://github.com/openshift/kubernetes/pull/2135

jacobsee avatar Apr 15 '25 02:04 jacobsee

@jacobsee: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

openshift-ci[bot] avatar Apr 15 '25 02:04 openshift-ci[bot]

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn openshift/kubernetes#2135

jacobsee avatar Apr 15 '25 02:04 jacobsee