coreos-assembler
coreos-assembler copied to clipboard
kola: add KOLA_LEAK_ON_FAIL
Sometimes, we hit test flakes that are hard to reproduce manually or
that are cumbersome to set up the same way a test does. Add a new
KOLA_LEAK_ON_FAIL env var which will cause kola to:
- enable console auto-login
- print the SSH key on the console
- not deprovision the machine if the test fails
This allows an engineer to to be able to dig deeper, poking at the VM when the failure happens, either through SSH, or through the serial console. So then we could expose this in the pipeline as a parameter so we can do custom runs with the variable enabled (and manually clean up any leaked VMs after investigating).
This is more powerful than --ssh-on-test-failure because it applies to
any failure in general, including provisioning failures, it allows for
debugging via serial console if SSH itself is broken, and it's feasible
to use in a pipeline.
Converting to draft.
@dustymabe said he's willing to try this out locally with this patch (thanks!). So let's try that first. If for whatever reason the only way to reliably reproduce this is in CI, we can always upload a custom cosa image to Quay.
I think maybe this patch needs to be tweaked slightly:
--- PASS: ext.config.gshadow (135.60s)
--- FAIL: ostree.unlock (738.00s)
harness.go:1187: Cluster failed starting machines: machine "420c0aa5-87f1-47b1-83d6-024c4953d312" failed to start: ssh journalctl failed: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
--- PASS: rpmostree.install-uninstall (539.54s)
--- PASS: rpmostree.install-uninstall/install (208.05s)
--- PASS: rpmostree.install-uninstall/uninstall (134.33s)
--- PASS: podman.network-single (389.09s)
I can see from the console of machines that we're definitely adding autologin (so I compiled it right and set the KOLA_OPENSTACK_TEMP_HACK env var right), but that particular error check isn't where the failure is occurring.
Ahhh, can you try again with -v? (Edit: or change the Infof to fmt.Printf I guess)
This seems sufficiently useful to generalize slightly, i.e. rename the environment variable to KOLA_LEAK_INSTANCES or KOLA_SKIP_INSTANCE_TEARDOWN or something and keep in the code.
(Well, I guess there is the openstack-specific bit where it prints the instance information, but we could do that a bit more generically?)
The autologin injection is tricky since we have some special cases that check e.g. "no Ignition provided". But maybe those could be explicitly opted out?
OK, I kinda spent a bit more time than I should've on this. But I think this can be a really useful tool in the kit for helping to debug test failures (esp. flakes).
Yeah, the SSH bit is tricky, even though these are just throwaway test VMs. The other approach I did was to just persist it to disk. But I'm trying to make it a convenient simple tool you can easily reach for. Persisting to disk is going to be really annoying if it's in the pipeline because it'd require oc login and oc exec just to get at the data. I have a similar concern for GPG in a secret.
Hmm, how about: we hardcode a public SSH key in kola, put the private key in BitWarden, and have kola add the key to all test VMs it spawns when KOLA_LEAK_ON_FAIL is on? That seems like the most user-friendly because it doesn't require any decryption either.
Hmm, how about: we hardcode a public SSH key in kola, put the private key in BitWarden, and have kola add the key to all test VMs it spawns when
KOLA_LEAK_ON_FAILis on?
OK did that now!
How about we have KOLA_LEAK_ON_FAIL actually be the pubkey to use?
How about we have KOLA_LEAK_ON_FAIL actually be the pubkey to use?
SGTM to start!
OK, updated this!
Follow-ups to this in https://github.com/coreos/fedora-coreos-pipeline/pull/419 and https://github.com/coreos/coreos-ci-lib/pull/93.
There is an alternative model here, which is to make it more something like KOLA_HANG_ON_FAIL/--hang-on-fail where kola just stops execution when a test fails so humans can debug into the machines, but once humans are done debugging, they can just resume kola. Kola retains ownership of the cloud resources so there's no need for humans to manually clean them up after.
@jlebon: PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@jlebon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/images | 19fb5e4928d8006562cfa850621ec2d92ec0e57b | link | true | /test images |
| ci/prow/rhcos | 19fb5e4928d8006562cfa850621ec2d92ec0e57b | link | true | /test rhcos |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
I think this could still be useful but will close it for now until we feel the need for similar functionality again.