coreos-assembler icon indicating copy to clipboard operation
coreos-assembler copied to clipboard

kola: add KOLA_LEAK_ON_FAIL

Open jlebon opened this issue 4 years ago • 15 comments

Sometimes, we hit test flakes that are hard to reproduce manually or that are cumbersome to set up the same way a test does. Add a new KOLA_LEAK_ON_FAIL env var which will cause kola to:

  • enable console auto-login
  • print the SSH key on the console
  • not deprovision the machine if the test fails

This allows an engineer to to be able to dig deeper, poking at the VM when the failure happens, either through SSH, or through the serial console. So then we could expose this in the pipeline as a parameter so we can do custom runs with the variable enabled (and manually clean up any leaked VMs after investigating).

This is more powerful than --ssh-on-test-failure because it applies to any failure in general, including provisioning failures, it allows for debugging via serial console if SSH itself is broken, and it's feasible to use in a pipeline.

jlebon avatar Oct 21 '21 16:10 jlebon

Converting to draft.

@dustymabe said he's willing to try this out locally with this patch (thanks!). So let's try that first. If for whatever reason the only way to reliably reproduce this is in CI, we can always upload a custom cosa image to Quay.

jlebon avatar Oct 21 '21 16:10 jlebon

I think maybe this patch needs to be tweaked slightly:

--- PASS: ext.config.gshadow (135.60s)
--- FAIL: ostree.unlock (738.00s)
        harness.go:1187: Cluster failed starting machines: machine "420c0aa5-87f1-47b1-83d6-024c4953d312" failed to start: ssh journalctl failed: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
--- PASS: rpmostree.install-uninstall (539.54s)
    --- PASS: rpmostree.install-uninstall/install (208.05s)
    --- PASS: rpmostree.install-uninstall/uninstall (134.33s)
--- PASS: podman.network-single (389.09s)

I can see from the console of machines that we're definitely adding autologin (so I compiled it right and set the KOLA_OPENSTACK_TEMP_HACK env var right), but that particular error check isn't where the failure is occurring.

dustymabe avatar Oct 21 '21 17:10 dustymabe

Ahhh, can you try again with -v? (Edit: or change the Infof to fmt.Printf I guess)

jlebon avatar Oct 21 '21 19:10 jlebon

This seems sufficiently useful to generalize slightly, i.e. rename the environment variable to KOLA_LEAK_INSTANCES or KOLA_SKIP_INSTANCE_TEARDOWN or something and keep in the code.

cgwalters avatar Oct 21 '21 20:10 cgwalters

(Well, I guess there is the openstack-specific bit where it prints the instance information, but we could do that a bit more generically?)

The autologin injection is tricky since we have some special cases that check e.g. "no Ignition provided". But maybe those could be explicitly opted out?

cgwalters avatar Oct 21 '21 20:10 cgwalters

OK, I kinda spent a bit more time than I should've on this. But I think this can be a really useful tool in the kit for helping to debug test failures (esp. flakes).

jlebon avatar Oct 28 '21 21:10 jlebon

Yeah, the SSH bit is tricky, even though these are just throwaway test VMs. The other approach I did was to just persist it to disk. But I'm trying to make it a convenient simple tool you can easily reach for. Persisting to disk is going to be really annoying if it's in the pipeline because it'd require oc login and oc exec just to get at the data. I have a similar concern for GPG in a secret.

Hmm, how about: we hardcode a public SSH key in kola, put the private key in BitWarden, and have kola add the key to all test VMs it spawns when KOLA_LEAK_ON_FAIL is on? That seems like the most user-friendly because it doesn't require any decryption either.

jlebon avatar Oct 29 '21 16:10 jlebon

Hmm, how about: we hardcode a public SSH key in kola, put the private key in BitWarden, and have kola add the key to all test VMs it spawns when KOLA_LEAK_ON_FAIL is on?

OK did that now!

jlebon avatar Oct 29 '21 16:10 jlebon

How about we have KOLA_LEAK_ON_FAIL actually be the pubkey to use?

jlebon avatar Nov 01 '21 17:11 jlebon

How about we have KOLA_LEAK_ON_FAIL actually be the pubkey to use?

SGTM to start!

cgwalters avatar Nov 01 '21 19:11 cgwalters

OK, updated this!

jlebon avatar Nov 01 '21 19:11 jlebon

Follow-ups to this in https://github.com/coreos/fedora-coreos-pipeline/pull/419 and https://github.com/coreos/coreos-ci-lib/pull/93.

jlebon avatar Nov 01 '21 20:11 jlebon

There is an alternative model here, which is to make it more something like KOLA_HANG_ON_FAIL/--hang-on-fail where kola just stops execution when a test fails so humans can debug into the machines, but once humans are done debugging, they can just resume kola. Kola retains ownership of the cloud resources so there's no need for humans to manually clean them up after.

jlebon avatar Nov 09 '21 15:11 jlebon

@jlebon: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Dec 22 '21 23:12 openshift-ci[bot]

@jlebon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images 19fb5e4928d8006562cfa850621ec2d92ec0e57b link true /test images
ci/prow/rhcos 19fb5e4928d8006562cfa850621ec2d92ec0e57b link true /test rhcos

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Feb 02 '22 22:02 openshift-ci[bot]

I think this could still be useful but will close it for now until we feel the need for similar functionality again.

jlebon avatar Sep 08 '23 15:09 jlebon