serving icon indicating copy to clipboard operation
serving copied to clipboard

Test for internal TLS certificate rotation

Open mgencur opened this issue 1 year ago • 10 comments

Add test for TLS certificate rotation

Proposed Changes

  • Add test for TLS certificate rotation (including rotation of certificates for Activator, queue-proxy, Ingress)
  • Re-implement the local function scanPodLogs as a more versatile WaitForLog that can wait for a given condition.
  • Fix a bug related to ServingFlags.SkipCleanupOnFail. The cleanup should really be skipped only if the test fails. This was missed during reviews on the original PR
  • Function getCertManagerCA now also waits for the Secret to exist rather than failing immediately

Release Note


mgencur avatar May 17 '24 08:05 mgencur

/hold

mgencur avatar May 17 '24 09:05 mgencur

/unhold

mgencur avatar May 17 '24 09:05 mgencur

The test fails with:

system_internal_tls_test.go:234: Deleting trust-bundle ConfigMap
    stream.go:305: D 11:08:49.430 autoscaler-7d8fbf5b9-m86j7 [metrics/stats_scraper.go:252] [serving-tests/tls-certificate-rotation-ebvlcsmm-00001] |OldPods| = 1, |YoungPods| = 0
    spoof.go:110: Spoofing tls-certificate-rotation-ebvlcsmm.serving-tests.example.com -> 34.138.175.134
    system_internal_tls_test.go:254: The endpoint http://tls-certificate-rotation-ebvlcsmm.serving-tests.example.com/ didn't serve the expected text "Hello World! How about some tasty noodles?": response: status: 503, body: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: TLS_error:|268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED:TLS_error_end:TLS_error_end, headers: map[Content-Length:[230] Content-Type:[text/plain] Date:[Fri, 17 May 2024 11:08:49 GMT] Server:[istio-envoy] Zipkin_trace_id:[d99519b5453a913fa98600ab36eea963]] did not pass checks: status = 503 503 Service Unavailable, want one of: [200]

@ReToCode I suppose the certificates might not be properly reloaded yet after deleting the routing-serving-certs secret and before we remove the "trust-bundle" configmap?

mgencur avatar May 17 '24 11:05 mgencur

The test passed when I added waiting for ca.crt in routing-serving-certs. Will try again during reviews.

mgencur avatar May 17 '24 13:05 mgencur

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 84.74%. Comparing base (b5455c9) to head (4e5bd95). Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15217      +/-   ##
==========================================
- Coverage   84.78%   84.74%   -0.04%     
==========================================
  Files         218      218              
  Lines       13479    13479              
==========================================
- Hits        11428    11423       -5     
- Misses       1685     1687       +2     
- Partials      366      369       +3     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 22 '24 10:05 codecov[bot]

@ReToCode this is ready for another review.

mgencur avatar May 23 '24 06:05 mgencur

/override "style / suggester / shell"

ReToCode avatar May 23 '24 06:05 ReToCode

@ReToCode: ReToCode unauthorized: /override is restricted to Repo administrators.

In response to this:

/override "style / suggester / shell"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow[bot] avatar May 23 '24 06:05 knative-prow[bot]

@dprotaso mind overriding the shell-check? IMHO, this should be in a separate PR, as only the time changed in this one.

ReToCode avatar May 23 '24 06:05 ReToCode

@ReToCode @dprotaso OK. I went ahead and fixed the shell check.

mgencur avatar May 28 '24 07:05 mgencur

/lgtm /approve

ReToCode avatar May 28 '24 08:05 ReToCode

@ReToCode Hmm. The latest error looks genuine:

system_internal_tls_test.go:225: Deleting secret in system namespace
    spoof.go:110: Spoofing tls-certificate-rotation-jryhlssh.serving-tests.example.com -> 35.231.189.127
    stream.go:305: D 08:40:20.489 autoscaler-56bfd47c76-4m7mn [statforwarder/processor.go:119] [serving-tests/tls-certificate-rotation-jryhlssh-00001] Forward stat of bucket autoscaler-bucket-07-of-10 to the holder autoscaler-56bfd47c76-dx6fr_10.120.4.6
    stream.go:305: D 08:40:20.489 autoscaler-56bfd47c76-dx6fr [statforwarder/processor.go:64] [serving-tests/tls-certificate-rotation-jryhlssh-00001] Accept stat as owner of bucket autoscaler-bucket-07-of-10
system_internal_tls_test.go:231: Deleting secret in ingress namespace
    spoof.go:110: Spoofing tls-certificate-rotation-jryhlssh.serving-tests.example.com -> 35.231.189.127
    stream.go:305: D 08:40:20.605 autoscaler-56bfd47c76-dx6fr [statforwarder/processor.go:64] [serving-tests/tls-certificate-rotation-jryhlssh-00001] Accept stat as owner of bucket autoscaler-bucket-07-of-10
system_internal_tls_test.go:237: Deleting old certificate from trust-bundle ConfigMap
    stream.go:305: D 08:40:20.707 autoscaler-56bfd47c76-dx6fr [metrics/stats_scraper.go:252] [serving-tests/tls-certificate-rotation-jryhlssh-00001] |OldPods| = 1, |YoungPods| = 0
    spoof.go:110: Spoofing tls-certificate-rotation-jryhlssh.serving-tests.example.com -> 35.231.189.127
    system_internal_tls_test.go:266: The endpoint http://tls-certificate-rotation-jryhlssh.serving-tests.example.com didn't serve the expected text "Hello World! How about some tasty noodles?": response: status: 503, body: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: TLS_error:|268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED:TLS_error_end:TLS_error_end, headers: map[Content-Length:[230] Content-Type:[text/plain] Date:[Tue, 28 May 2024 08:40:20 GMT] Server:[istio-envoy] Zipkin_trace_id:[95ccff430d23ad30786fccb1f81036fa]] did not pass checks: status = 503 503 Service Unavailable, want one of: [200]

So, after deleting the old cert from the trust-bundle there's an error with TLS. Any idea why this could happen?

mgencur avatar May 28 '24 09:05 mgencur

@ReToCode As you mentioned here, we should not be deleting the old cert from the trust-bundle ConfigMap because the test is quick and the certificates might not be rotated yet. So, let's keep the ConfigMap?

mgencur avatar May 28 '24 10:05 mgencur

@ReToCode As you mentioned https://github.com/knative/serving/pull/15217#discussion_r1607753785, we should not be deleting the old cert from the trust-bundle ConfigMap because the test is quick and the certificates might not be rotated yet. So, let's keep the ConfigMap?

+1, yes that could take a while to fully complete. Makes not much sense to wait for that in an e2e test.

ReToCode avatar May 28 '24 10:05 ReToCode

/lgtm /approve

ReToCode avatar May 28 '24 11:05 ReToCode

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mgencur, ReToCode

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow[bot] avatar May 28 '24 11:05 knative-prow[bot]

IMHO, we can skip certmanager-integration-tests_serving_main, as per https://github.com/knative/serving/pull/15261

ReToCode avatar May 28 '24 11:05 ReToCode

/override "certmanager-integration-tests_serving_main"

dprotaso avatar May 28 '24 13:05 dprotaso

@dprotaso: Overrode contexts on behalf of dprotaso: certmanager-integration-tests_serving_main

In response to this:

/override "certmanager-integration-tests_serving_main"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow[bot] avatar May 28 '24 13:05 knative-prow[bot]

@mgencur: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
certmanager-integration-tests_serving_main 4e5bd956f960b51d366571ddb4022e814e63c297 link unknown /test certmanager-integration-tests

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

knative-prow[bot] avatar May 28 '24 13:05 knative-prow[bot]