kpt-config-sync
kpt-config-sync copied to clipboard
[WIP] Improve multi-repo e2e test reset
- Add a Reconciler Finalizer to handle deletion propagation
- Use deletion propagation to handle cleanup of managed test resources.
- Rewrite multi-repo test cleanup to handle RSync finalization before reconciler dependencies are deleted.
- Set the Repository.Format on ReInit, in case it has changed.
/hold
Turns out this doesn't actually fix the GKE tests, even tho it works in Kind. The reset code isn't well tested in Kind because it just throws away the cluster after each test.
Looks like I've still got a bug in here. I need to make it skip modifying managed RSyncs and store them in a queue to reset. Otherwise the webhook rejects the change. And even if I disable the webhook it's still faster to not have to re-create the RSync to reset it and delete it again.
Added new reconciler finalizer to see if it passes all the e2e tests before I break this down into smaller PRs and add more unit tests.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi by writing /assign @karlkfi in a comment. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Extracted dependency updates: https://github.com/GoogleContainerTools/kpt-config-sync/pull/232
Extracted reconciler shutdown imporvements: https://github.com/GoogleContainerTools/kpt-config-sync/pull/234
Extracted UpstreamRepoURL removal: https://github.com/GoogleContainerTools/kpt-config-sync/pull/240
Extracted finalizer to https://github.com/GoogleContainerTools/kpt-config-sync/pull/242
$ make test-e2e-kind-multi-repo-test-group3 E2E_ARGS="--share-test-env -v --test.v -p 1 --test.parallel=6 --timeout 2h"
...
PASS
ok kpt.dev/configsync/e2e/testcases 1507.431s
Tests took 1566 seconds
make[1]: Leaving directory '/home/karlisenberg/workspace/config-sync-prototype'
$ make test-e2e-kind-multi-repo-test-group2 E2E_ARGS="--share-test-env -v --test.v -p 1 --test.parallel=6 --timeout 2h"
...
PASS
ok kpt.dev/configsync/e2e/testcases 984.499s
Tests took 1027 seconds
make[1]: Leaving directory '/home/karlisenberg/workspace/config-sync-prototype'
Checkpointing protected namespaces fixes the last issue I've seen locally.
$ make test-e2e-kind-multi-repo-test-group1 E2E_ARGS="--share-test-env -v --test.v -p 1 --test.parallel=6 --timeout 2h"
...
PASS
ok kpt.dev/configsync/e2e/testcases 953.833s
Tests took 988 seconds
make[1]: Leaving directory '/home/karlisenberg/workspace/config-sync-prototype'
/retest
I think this is a pre-existing flakey error:
Error: cannot build pipelines: failt to create "opencensus" receiver, in pipeline "metrics": failed to bind to address "0.0.0.0:55678": listen tcp 0.0.0.0:55678: bind: address already in use
All 3 e2e tests groups passed locally on kind with shared-env using the latest changes & rebase.
multi_sync_test.go:107: ERROR: admission webhook "rbac.authorization.k8s.io.v1.admission-webhook.configsync.gke.io" denied the request: kubernetes-admin is not authorized to delete managed resource "rbac.authorization.k8s.io_rolebinding_test-ns_nr2"
Looks like the extra cleanup in TestMultiSyncs_Unstructured_MixedControl calling ResetRepoSyncs before ResetRootSyncs was violating the assumption that all RSyncs would be deleted before trying to delete all the remaining RoleBindings. So I added a check to skip deleting managed RoleBindings.
If this causes a side effect, it will probably manifest as a namespace garbage collector deadlocking with a similar error.
If that happens, we may have to disable the admission webhook in reset, which I don't want to do, because it will slow things down re-enabling it every time.
Another option would be to revert the TestMultiSyncs_Unstructured_MixedControl cleanup change and let it keep doing it the hard way. But I don't want to do that, because ResetRepoSyncs is more robust and accounts for more edge cases.
/lgtm /approve
/retest
I'm not sure why this is is flakey. It seems like this test creates test-ns and fails if it exists, indicating the previous cleanup was incomplete, but I'm not sure why...
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: nan-yu, sdowell
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [nan-yu,sdowell]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment