serving
serving copied to clipboard
[wip] [test-only] Test for controller ha
Proposed Changes
- See #15238
Release Note
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: skonto
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [skonto]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 84.80%. Comparing base (
62ce45c) to head (fbf67dd). Report is 150 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #15321 +/- ##
==========================================
+ Coverage 84.76% 84.80% +0.03%
==========================================
Files 218 218
Lines 13504 13504
==========================================
+ Hits 11447 11452 +5
+ Misses 1690 1686 -4
+ Partials 367 366 -1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
/test istio-latest-no-mesh
The lease seems not being updated after pod restart, although it has been released. logs here and here downloaded-logs-20240611-000442.json I will try to print the leases after the restart to verify it.
"insertId": "am8bz9y3f27j2j3s",
"jsonPayload": {
"caller": "leaderelection/context.go:167",
"timestamp": "2024-06-10T19:27:08.396Z",
"logger": "controller",
"commit": "bbc55b6",
"message": "\"controller-6cdfc667d6-4b8qq_2b1432a3-4afd-4c2a-b3ef-05bb4523f4ab\" has stopped leading \"controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.08-of-10\"",
"knative.dev/pod": "controller-6cdfc667d6-4b8qq"
/test istio-latest-no-mesh
/test istio-latest-no-mesh
/test istio-latest-no-mesh
/test istio-latest-no-mesh
/test istio-latest-no-mesh
/test istio-latest-no-mesh
/test istio-latest-no-mesh
Reproduced it again but missed lease printing due to wrong ns, re-running. :crossed_fingers:
/test istio-latest-no-mesh
/test istio-latest-no-mesh
/test istio-latest-no-mesh
/test istio-latest-no-mesh
/test istio-latest-no-mesh
Obviously not all leases are updated after pod restarting:
The second time the tests are executed all certificates leases but one are updated.
Looking at the downloaded-logs-20240615-204712.json it seems that the lock is never released
{
"insertId": "m77ofqucl0dxhuuf",
"jsonPayload": {
"caller": "leaderelection/context.go:167",
"message": "\"controller-859dc45cfb-8l5m7_23d0e3f1-2b3d-4a5c-a870-948746801c09\" has stopped leading \"controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-10\"",
"knative.dev/pod": "controller-859dc45cfb-8l5m7",
"logger": "controller",
"commit": "d071b17",
"timestamp": "2024-06-15T10:21:27.869Z"
},
"resource": {
"type": "k8s_container",
"labels": {
"cluster_name": "kt2-4bc2f2da-af36-40c2-bad4-6cc8742ae-1",
"location": "us-east1",
"container_name": "controller",
"namespace_name": "1357db35-8ecc-4cb7-b5e5-a7c9b7edbf4b",
"pod_name": "controller-859dc45cfb-8l5m7",
"project_id": "knative-boskos-76"
}
},
"timestamp": "2024-06-15T10:21:27.869226666Z",
"severity": "INFO",
"labels": {
"k8s-pod/kapp_k14s_io/app": "1718445170413216642",
"k8s-pod/app": "controller",
"k8s-pod/app_kubernetes_io/version": "devel",
"compute.googleapis.com/resource_name": "gke-kt2-4bc2f2da-af36-40-default-pool-5aec12e9-v7f4",
"k8s-pod/kapp_k14s_io/association": "v1.ae5f7406090d99cc0cb95abd8ded8439",
"k8s-pod/app_kubernetes_io/component": "controller",
"k8s-pod/pod-template-hash": "859dc45cfb",
"k8s-pod/app_kubernetes_io/name": "knative-serving"
},
"logName": "projects/knative-boskos-76/logs/stdout",
"receiveTimestamp": "2024-06-15T10:21:31.109539752Z"
},
{
"insertId": "0yg652b5n1vnnjye",
"jsonPayload": {
"message": "Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io \"controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-10\": the object has been modified; please apply your changes to the latest version and try again",
"pid": "1"
},
"resource": {
"type": "k8s_container",
"labels": {
"container_name": "controller",
"project_id": "knative-boskos-76",
"pod_name": "controller-859dc45cfb-8l5m7",
"location": "us-east1",
"cluster_name": "kt2-4bc2f2da-af36-40c2-bad4-6cc8742ae-1",
"namespace_name": "1357db35-8ecc-4cb7-b5e5-a7c9b7edbf4b"
}
},
"timestamp": "2024-06-15T10:21:27.869230187Z",
"severity": "ERROR",
"labels": {
"k8s-pod/kapp_k14s_io/association": "v1.ae5f7406090d99cc0cb95abd8ded8439",
"compute.googleapis.com/resource_name": "gke-kt2-4bc2f2da-af36-40-default-pool-5aec12e9-v7f4",
"k8s-pod/app_kubernetes_io/component": "controller",
"k8s-pod/app": "controller",
"k8s-pod/app_kubernetes_io/name": "knative-serving",
"k8s-pod/app_kubernetes_io/version": "devel",
"k8s-pod/kapp_k14s_io/app": "1718445170413216642",
"k8s-pod/pod-template-hash": "859dc45cfb"
},
"logName": "projects/knative-boskos-76/logs/stderr",
"sourceLocation": {
"file": "leaderelection.go",
"line": "308"
},
"receiveTimestamp": "2024-06-15T10:21:31.157478025Z"
}
]
cc @dprotaso
/test istio-latest-no-mesh
/test istio-latest-no-mesh
/test istio-latest-no-mesh
@skonto: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| istio-latest-no-mesh_serving_main | fbf67ddebad1b8d5fd1aac7a81307bedeba3e0c9 | link | true | /test istio-latest-no-mesh |
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
Last run here shows that:
After restarting we have,
...
chaosduck-f45fb7b47-l85xb 1/1 Running 0 28m
controller-5bfdc8f748-7nx6j 1/1 Running 0 3s
controller-5bfdc8f748-dfgds 1/1 Running 0 3s
controller-5bfdc8f748-mn922 1/1 Running 0 3s
Controller pods have 9 reconcilers as expected:
chaosduck kills one of the controllers
However before that and before we update the cm chaos is also deleting pods leaving leases not cleaned up:
controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-10
controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.01-of-10
controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.02-of-10
controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.03-of-10
controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.04-of-10
controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.05-of-10
controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.06-of-10
controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.07-of-10
controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.08-of-10 controller-5bfdc8f748-rwwfx_71275a6d-e52c-4375-a4ed-a011c97c9dfe
controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.09-of-10
The controller pod created before the ones above will run leader election on the certificate reconciler as it probably runs before cm update:
However soon enough it stops leading:
As chaos kicks in again:
However not all leases are released since the last update fails with:
It seems go client does not clean up everything.
From the timestamps it is shown that controller-5bfdc8f748-rwwfx starts before controller-5bfdc8f748-7nx6j due to chaos and uses the old cm settings.
A lot of controller pods are being created due to chaos:
Full list downloaded-logs-20240618-194319.csv.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Reopen with /reopen. Mark as fresh by adding the
comment /remove-lifecycle stale.