origin
origin copied to clipboard
TLS 1.3 / Modern profile tests
This PR:
- Removes a skip on an existing TLS test - it appears to have been a setup issue
- Adds a test for ensuring that components are properly listening for only the expected TLS version or higher.
cc: @dusk125
Supports OCPBUGS-37706, OCPSTRAT-1364
Job Failure Risk Analysis for sha: 6c00d7595e29ced0457431e47381bbbd6bb7732d
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling | High [sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. |
/lgtm
/retest-required
Job Failure Risk Analysis for sha: ff621200d7324a17f2efb668a046241be92fd5dc
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-main-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback | MissingData |
| pull-ci-openshift-origin-main-e2e-aws-disruptive | Medium [sig-node] static pods should start after being created Potential external regression detected for High Risk Test analysis |
| pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling | Low [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded This test has passed 0.00% of 1 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week. --- [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available This test has passed 0.00% of 1 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week. --- [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available This test has passed 0.00% of 1 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week. |
| pull-ci-openshift-origin-main-e2e-gcp-disruptive | Medium [sig-node] static pods should start after being created Potential external regression detected for High Risk Test analysis --- [sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator Potential external regression detected for High Risk Test analysis |
| pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling | Medium [bz-etcd][invariant] alert/etcdMembersDown should not be at or above info Potential external regression detected for High Risk Test analysis --- [sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times Potential external regression detected for High Risk Test analysis |
which jobs run the new test ? i have checked metal-ipi-serial, aws-ovn-microshift-serial and aws-ovn-serial but didn't find the new test on the list.
Job Failure Risk Analysis for sha: b6d72fde3e33918c98d308205a1485ad12e5a221
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-main-e2e-gcp-disruptive | Medium [sig-node] static pods should start after being created Potential external regression detected for High Risk Test analysis |
which jobs run the new test ? i have checked metal-ipi-serial, aws-ovn-microshift-serial and aws-ovn-serial but didn't find the new test on the list.
@jacobsee I think you need to run make update to actually "add" the test to the list of tests to be executed. For example: https://github.com/openshift/origin/pull/29191/commits
@p0lyn0mial Thanks! I hadn't seen that. It's in there now, and I've realized that it should probably be marked [slow] as well. To your earlier comment, do we now still need to pick jobs to run this? This is my first brush with the origin tests, so I'm not sure what all is needed to get a new one plugged in correctly.
Job Failure Risk Analysis for sha: 39e5f2297637306461c6bccc35c9c781c6829e13
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling | Low [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded This test has passed 50.00% of 2 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week. |
| pull-ci-openshift-origin-main-e2e-gcp-disruptive | High [bz-Monitoring] clusteroperator/monitoring should not change condition/Degraded This test has passed 99.30% of 5292 runs on release 4.19 [Overall] in the last week. --- [bz-Monitoring] clusteroperator/monitoring should not change condition/Available This test has passed 99.45% of 5292 runs on release 4.19 [Overall] in the last week. |
To your earlier comment, do we now still need to pick jobs to run this?
it looks like you could use /payload-job-with-prs to test your new test with the o/k PR, for example:
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn openshift/kubernetes#2135
xref: https://docs.ci.openshift.org/docs/release-oversight/pull-request-testing/#payload-job
we just need to find a periodic job that will run your test. maybe some serial test from ?
so, overall:
- we add a new test to o/o. (this PR)
- we use the
/payload-job-with-prscommand to run the test with the o/k PR - we make sure the test is green
- we merge the o/k PR
- we merge this PR
@p0lyn0mial: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/745c1ee0-0985-11f0-8a93-335e1efa7bdb-0
@p0lyn0mial Addressed several things - however the wait for etcd to stabilize was a deliberate decision after a chat with @dusk125 however, so maybe we should bring that context to this thread here. The thinking was that it is also desirable to have a test for etcd's TLS 1.3 stability, and the test would have 90% the same code as this one (just with a different check in the middle), so we might we well ensure that both stabilize here to avoid duplicating an already-slow test.
Job Failure Risk Analysis for sha: 76dc7ef5f57fe9ffedd977c9a8a1c35062660a81
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling | High [sig-architecture] platform pods in ns/openshift-cluster-csi-drivers should not exit an excessive amount of times This test has passed 99.98% of 6449 runs on release 4.19 [Overall] in the last week. |
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn openshift/kubernetes#2135
@jacobsee: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5ac351c0-0e39-11f0-9a0c-207cdc87222c-0
@p0lyn0mial I might not be looking in the right place/way but I'm not seeing jobs that run a [serial][slow] test - could it be that it wouldn't get picked up by any? https://sippy.dptools.openshift.org/sippy-ng/tests/4.19?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522%255Bserial%255D%2522%257D%252C%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522%255Bslow%255D%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&pageSize=100&sort=asc&sortField=current_working_percentage
There are a few tests in this repository in a similar boat (like TestImportRepositoryFromBlockedRegistry) that I don't see runs for.
This new test works when I run it manually but do you have any tips for getting it integrated into existing jobs?
@p0lyn0mial I might not be looking in the right place/way but I'm not seeing jobs that run a [serial][slow] test - could it be that it wouldn't get picked up by any? https://sippy.dptools.openshift.org/sippy-ng/tests/4.19?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522%255Bserial%255D%2522%257D%252C%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522%255Bslow%255D%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&pageSize=100&sort=asc&sortField=current_working_percentage
There are a few tests in this repository in a similar boat (like TestImportRepositoryFromBlockedRegistry) that I don't see runs for.
This new test works when I run it manually but do you have any tips for getting it integrated into existing jobs?
@wangke19 @bertinatto could you help with ^ ?
@p0lyn0mial Addressed several things - however the wait for etcd to stabilize was a deliberate decision after a chat with @dusk125 however, so maybe we should bring that context to this thread here. The thinking was that it is also desirable to have a test for etcd's TLS 1.3 stability
WDYM ? I think that jobs in CI run monitors that ensure the overall stability of the cluster. Otherwise, each individual test would have to implement this functionality on its own.
@jacobsee on second thought, maybe we should actually test more control plane components.
Here is a list of components that will be affected after changing the profile.
For instance, the aggregated APIs (OpenShift API server and OAuth API server) are behind the kube-apiserver (KAS). So, it's not enough to just check the connection to KAS. KAS has a built-in proxy that forwards requests to the aggregated APIs. To properly test the aggregated APIs and the proxy client in KAS, we would need to send an HTTPS request through KAS.
I believe it's worth the effort — do you agree? Should we also test the other components?
@p0lyn0mial I do agree that testing other components makes sense if we anticipate an impact to them, unsure if it makes more sense to try to wrap it all into one PR upfront or just get start with something like this... Regarding that impact to other components, I tried to solicit some additional opinions as I haven't spent much time working with them and found out that what we're doing here may actually fall under the coverage of an existing test. I will track that down in the morning.
@jacobsee after the profile change, only the minimum acceptable TLS version will be allowed. I think we should test the affected components and verify that the platform remains stable.
To test the components, it will probably be enough to send a TLS/HTTP2 request. To make sure we haven’t accidentally broken anything else, we’ll likely need to check if all operators—or at least the affected ones—aren’t reporting any errors. I’m not sure if we can rely on the existing monitors for this, or if we should write some custom code to check the status of the operators.
It seems to me that before we merge the code into o/k, we need to make sure the platform is functioning properly. Do you agree?
To make it more concrete, I think we could:
- Identify the affected components after the profile change.
- Check whether version 1.3 is disallowed for each component.
- Update the profile.
- Wait for the KAS rollout — we could take a shortcut here and only wait for the KAS, as it takes the longest to roll out.
- Verify that version 1.3 is now allowed for each component. For most components, we can use the TLS client. The aggregated APIs might be a bit special — for the aggregated APIs, we need to construct the proper URL path so that the KAS proxy forwards traffic to the appropriate server responsible for that group.
- Read the operator resources of the affected components and ensure the operators are not degraded and are not progressing.
Initially, I thought we could rely on the monitors for step 6, but I'm not sure what exactly they check or whether all jobs actually run them. @dgoodwin maybe you can chime in on whether relying on the monitors would be enough?
@p0lyn0mial A couple of thoughts:
The Kube API server (and the OAuth server, which I have now also added a check for) are easy to check because there is a route to those services from outside the cluster, where the origin tests will be running. For other components, do you have any tests or otherwise examples you can point me towards in our infrastructure where we need to access the cluster network to make requests directly against services that are not exposed outside the cluster, for testing the TLS dialing? Is the accepted answer there to set up a port forward to the test for each service?
I've also added a request that requires the aggregated OCP API to be working, but it is a regular request through a client, as I don't believe that tls.Dial with a path component is valid... I think we would need to hit the OCP API server from the test directly, not through the Kube API's proxy.
I've also added progressing/degraded checks for cluster operators that appeared to be affected.
@jacobsee oh, you're right, the tests are run from outside the cluster.
Mhm, I don't have any examples. I think we could either use port-forward or schedule a pod with a custom binary that performs the check. Not sure ,maybe using exec.CommandContext(ctx, "oc", "port-forward", ...) would actually be easier.
Also, I believe other components expose an HTTPS endpoint for metrics and health/liveness probes. For probes, we could potentially rely on the kubelet and the StaticPodsDegraded condition (at least for the static pods). For metrics, I doubt the monitoring operator would go degraded just because it can’t scrape metrics. Maybe @dgrisonnet could chime in here? For monitoring, maybe we could simply try scraping the metrics ourselves.
But the more I think about how to test this change, the more I lean toward simply creating a periodic job that updates the profile, ensures the change has been observed by the affected components and then simply runs some (conformance) tests.
I’ve also started an internal thread about it. We might also need the test you created, at the very least, we should verify that the affected components actually picked up the new TLS profile.
I don't think cluster-monitoring-operator would report degraded if a target is unavailable, but you could rely on the TargetDown alert.
I don't remember if we have a catch all for pods liveness, but you could always rely on kubelet's prober_probe_total metric.
/lgtm
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn https://github.com/openshift/kubernetes/pull/2135
@jacobsee: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn openshift/kubernetes#2135