Add `test/integration`
Summary
Adding a framework for integration tests.
Startup takes a long time - I am contemplating adding support for a shared embedded etcd (or an external etcd like in kube) if that significantly improves startup.
What Type of PR Is This?
/kind feature
Related Issue(s)
Release Notes
NONE
Hi @ntnn. Thanks for your PR.
I'm waiting for a kcp-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Tests with race condition testing fails. I think kube also doesn't race test their integration tests.
/ok-to-test
/test pull-kcp-test-unit
/test pull-kcp-test-e2e-shared
Startup takes a long time - I am contemplating adding support for a shared embedded etcd (or an external etcd like in kube) if that significantly improves startup.
This is kind of a known problem with starting kcp the first time, so we should probably tackle this in kcp itself, not the integration test framework.
Yeah, I just measured the startup times - etcd takes less than a second, kcp phase1 takes roughly 22s. And using a shared kcd for the integration tests would make less sense imho - if the kcp instance could be shared without tests affecting each other they could likely also be e2e tests.
Running with race detection works. Root cause was that this method could be called from multiple goroutines if no DialContext is set:
https://github.com/kcp-dev/kcp/blob/a2014d6bc9c7585fa55dc1d1b4f8476a805e8b86/pkg/network/dialer.go#L39-L48
This originates from the server config here: https://github.com/kcp-dev/kcp/blob/a2014d6bc9c7585fa55dc1d1b4f8476a805e8b86/pkg/server/config.go#L208-L210
So either the wrapper function is wrapped with a mutex to prevent concurrent writes or the wrapper function acts atomically. I don't think the GenericAPIServer can be persuaded to call this initially.
I don't particulalry like either option. ~~Another option could be to just remove that config - TCP timeout should be 10m anyhow iirc.~~ Scratch that, I read 25 * time.Minute - but it's 25 * time.Second:
https://github.com/kcp-dev/kcp/blob/a2014d6bc9c7585fa55dc1d1b4f8476a805e8b86/pkg/network/dialer_linux.go#L35-L41
Technically it would maybe possible to set .Transport on the GenericAPIConfig with a transport that already has the TCP timeout set.
Maybe there was a good reason not to use .Transport and instead to define a wrapper:
https://github.com/kcp-dev/kcp/commit/9574eefc68c56e0f297643b000e77135eb095c6a#diff-d84eac55294d8f540a660bfcb4043f7c4e1f76c8bccbb91c129e8e4992dbafe7R216
However I can't discern a reason from reading the documentation - it even reads more like setting the transport would be the right approach to setting the TCP timeout:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/rest/config.go#L104-L107
// Transport may be used for custom HTTP behavior. This attribute may not
// be specified with the TLS client certificate options. Use WrapTransport
// to provide additional per-server middleware behavior.
Transport http.RoundTripper
From attempting to set this attribute instead of wrap - I'd guess wrapping was used because setting a transport requires setting up a transport how the library expects it rather than having the library provide a working transport and just adjusting the timeouts.
/test pull-kcp-test-e2e-multiple-runs
Found out why .Transport isn't used. When this is set this if takes hold:
https://github.com/kubernetes/kubernetes/blob/30469e180361d7da07b0fee6d47c776fa2cf3e86/staging/src/k8s.io/client-go/transport/transport.go#L34-L40
// New returns an http.RoundTripper that will provide the authentication
// or transport level security defined by the provided Config.
func New(config *Config) (http.RoundTripper, error) {
// Set transport level security
if config.Transport != nil && (config.HasCA() || config.HasCertAuth() || config.HasCertCallback() || config.TLS.Insecure) {
return nil, fmt.Errorf("using a custom transport with TLS certificate options or the insecure flag is not allowed")
}
So the TLS configuration would have to be removed, which then causes connections to fail due to the missing TLS configuration:
E0415 14:28:52.537341 14738 storage_rbac.go:191] "Unhandled Error" err="unable to initialize clusterroles: Get \"https://127.0.0.1:57059/clusters/system:admin/apis/rbac.authorization.k8s.io/v1/clusterroles\": tls: failed to verify certificate: x509: certificate signed by unknown authority" logger="UnhandledError"
F0415 14:28:52.538522 14738 hooks.go:210] PostStartHook "kcp-bootstrap-policy" failed: unable to initialize roles: timed out waiting for the condition
FAIL github.com/kcp-dev/kcp/test/integration/framework 39.978s
https://github.com/kcp-dev/kcp/commit/a07f6105c1016aac468dbb6147ce47021eba8a62#diff-d84eac55294d8f540a660bfcb4043f7c4e1f76c8bccbb91c129e8e4992dbafe7R210
Technically a bug, but that was introduced >7y ago - for a fix on the KCP side for now we'll have to go with either of the workarounds.
Very nice!
About startup, I think we need somebody debugging our startup process. I think there must be some polling that is slow. At some point, we have split the kcp core informers into one part that does not need identities (system CRDs) and those resource based on APIExports (which need identities). And there is logic to bootstrap in a multi-shard env. Something in this area must be far too slow. Maybe some polling with too long interval.
Marking as WIP as I want to change how the server is started
The goleak ignore list is pretty long but I'm fairly certain that quite a few are just downstream of other issues.
LGTM label has been added.
/lgtm /approve
LGTM label has been added.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: mjudeikis
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [mjudeikis]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment