kpt-config-sync icon indicating copy to clipboard operation
kpt-config-sync copied to clipboard

[WIP] Improve multi-repo e2e test reset

Open karlkfi opened this issue 3 years ago • 8 comments

  • Add a Reconciler Finalizer to handle deletion propagation
  • Use deletion propagation to handle cleanup of managed test resources.
  • Rewrite multi-repo test cleanup to handle RSync finalization before reconciler dependencies are deleted.
  • Set the Repository.Format on ReInit, in case it has changed.

karlkfi avatar Oct 07 '22 05:10 karlkfi

/hold

Turns out this doesn't actually fix the GKE tests, even tho it works in Kind. The reset code isn't well tested in Kind because it just throws away the cluster after each test.

karlkfi avatar Oct 07 '22 21:10 karlkfi

Looks like I've still got a bug in here. I need to make it skip modifying managed RSyncs and store them in a queue to reset. Otherwise the webhook rejects the change. And even if I disable the webhook it's still faster to not have to re-create the RSync to reset it and delete it again.

karlkfi avatar Oct 08 '22 01:10 karlkfi

Added new reconciler finalizer to see if it passes all the e2e tests before I break this down into smaller PRs and add more unit tests.

karlkfi avatar Oct 15 '22 00:10 karlkfi

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi by writing /assign @karlkfi in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Oct 19 '22 23:10 google-oss-prow[bot]

Extracted dependency updates: https://github.com/GoogleContainerTools/kpt-config-sync/pull/232

karlkfi avatar Oct 21 '22 19:10 karlkfi

Extracted reconciler shutdown imporvements: https://github.com/GoogleContainerTools/kpt-config-sync/pull/234

karlkfi avatar Oct 21 '22 22:10 karlkfi

Extracted UpstreamRepoURL removal: https://github.com/GoogleContainerTools/kpt-config-sync/pull/240

karlkfi avatar Oct 25 '22 00:10 karlkfi

Extracted finalizer to https://github.com/GoogleContainerTools/kpt-config-sync/pull/242

karlkfi avatar Oct 25 '22 21:10 karlkfi

$ make test-e2e-kind-multi-repo-test-group3 E2E_ARGS="--share-test-env -v --test.v -p 1 --test.parallel=6 --timeout 2h"
...
PASS
ok  	kpt.dev/configsync/e2e/testcases	1507.431s
Tests took 1566 seconds
make[1]: Leaving directory '/home/karlisenberg/workspace/config-sync-prototype'

karlkfi avatar Jan 12 '23 21:01 karlkfi

$ make test-e2e-kind-multi-repo-test-group2 E2E_ARGS="--share-test-env -v --test.v -p 1 --test.parallel=6 --timeout 2h"
...
PASS
ok  	kpt.dev/configsync/e2e/testcases	984.499s
Tests took 1027 seconds
make[1]: Leaving directory '/home/karlisenberg/workspace/config-sync-prototype'

karlkfi avatar Jan 12 '23 22:01 karlkfi

Checkpointing protected namespaces fixes the last issue I've seen locally.

$ make test-e2e-kind-multi-repo-test-group1 E2E_ARGS="--share-test-env -v --test.v -p 1 --test.parallel=6 --timeout 2h"
...
PASS
ok  	kpt.dev/configsync/e2e/testcases	953.833s
Tests took 988 seconds
make[1]: Leaving directory '/home/karlisenberg/workspace/config-sync-prototype'

karlkfi avatar Jan 13 '23 02:01 karlkfi

/retest

I think this is a pre-existing flakey error: Error: cannot build pipelines: failt to create "opencensus" receiver, in pipeline "metrics": failed to bind to address "0.0.0.0:55678": listen tcp 0.0.0.0:55678: bind: address already in use

karlkfi avatar Jan 17 '23 22:01 karlkfi

All 3 e2e tests groups passed locally on kind with shared-env using the latest changes & rebase.

karlkfi avatar Jan 18 '23 00:01 karlkfi

multi_sync_test.go:107: ERROR: admission webhook "rbac.authorization.k8s.io.v1.admission-webhook.configsync.gke.io" denied the request: kubernetes-admin is not authorized to delete managed resource "rbac.authorization.k8s.io_rolebinding_test-ns_nr2"

Looks like the extra cleanup in TestMultiSyncs_Unstructured_MixedControl calling ResetRepoSyncs before ResetRootSyncs was violating the assumption that all RSyncs would be deleted before trying to delete all the remaining RoleBindings. So I added a check to skip deleting managed RoleBindings.

If this causes a side effect, it will probably manifest as a namespace garbage collector deadlocking with a similar error.

If that happens, we may have to disable the admission webhook in reset, which I don't want to do, because it will slow things down re-enabling it every time.

Another option would be to revert the TestMultiSyncs_Unstructured_MixedControl cleanup change and let it keep doing it the hard way. But I don't want to do that, because ResetRepoSyncs is more robust and accounts for more edge cases.

karlkfi avatar Jan 21 '23 00:01 karlkfi

/lgtm /approve

sdowell avatar Jan 23 '23 21:01 sdowell

/retest

I'm not sure why this is is flakey. It seems like this test creates test-ns and fails if it exists, indicating the previous cleanup was incomplete, but I'm not sure why...

karlkfi avatar Jan 24 '23 04:01 karlkfi

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nan-yu, sdowell

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Jan 24 '23 05:01 google-oss-prow[bot]