Kola tests flake because reboots fail randomly
kola tests will frequently cause the nightly to fail because the machine goes down for a reboot and doesn't come back up for unknown reasons. console.txt shows the machine going down for a reboot then ends.
Example $ tail console.txt from a failed coreos.locksmith.reboot:
[ 14.296903] EXT4-fs (vda9): re-mounted. Opts: data=ordered
[ 14.297540] systemd-shutdown[1]: Remounting '/usr' read-only with options 'seclabel,block_validity,delalloc,barrier,user_xattr,acl'.
[ 14.298362] EXT4-fs (dm-0): re-mounted. Opts: block_validity,delalloc,barrier,user_xattr,acl
[ 14.299057] systemd-shutdown[1]: Unmounting /usr.
[ 14.299377] systemd-shutdown[1]: Could not unmount /usr: Device or resource busy
[ 14.299857] systemd-shutdown[1]: Remounting '/' read-only with options 'seclabel,data=ordered'.
[ 14.300472] EXT4-fs (vda9): re-mounted. Opts: data=ordered
[ 14.306517] Unregister pv shared memory for cpu 0
[ 14.307076] reboot: Restarting system
[ 14.307332] reboot: machine restart
I've only observed this with the nightly builds. I'm currently trying to reproduce locally and will update this bug once I ensure I can reproduce locally.
My guess is that qemu itself is dying for some reason.
Is this a qemu-specific problem?
It looks like it - I can't find any failures of this kind on other platforms. qemu_uefi is hard to tell since those time out half the time (woo, another bug).
I'm fairly sure this is the reason we get the a lot of weird test flakes, especially the coreos.verity.* ones.
Looking at console output from failed nightly builds now that https://github.com/coreos/mantle/pull/775 has merged, it appears qemu is either exiting with status 0, is being killed in error, or is getting OOM-killed. Logs from the jenkins worker do not show it being OOM-killed.
Confirmed kola does log stderr from qemu to stderr as well.