mantle icon indicating copy to clipboard operation
mantle copied to clipboard

Kola tests flake because reboots fail randomly

Open ajeddeloh opened this issue 8 years ago • 4 comments

kola tests will frequently cause the nightly to fail because the machine goes down for a reboot and doesn't come back up for unknown reasons. console.txt shows the machine going down for a reboot then ends.

Example $ tail console.txt from a failed coreos.locksmith.reboot:

[   14.296903] EXT4-fs (vda9): re-mounted. Opts: data=ordered
[   14.297540] systemd-shutdown[1]: Remounting '/usr' read-only with options 'seclabel,block_validity,delalloc,barrier,user_xattr,acl'.
[   14.298362] EXT4-fs (dm-0): re-mounted. Opts: block_validity,delalloc,barrier,user_xattr,acl
[   14.299057] systemd-shutdown[1]: Unmounting /usr.
[   14.299377] systemd-shutdown[1]: Could not unmount /usr: Device or resource busy
[   14.299857] systemd-shutdown[1]: Remounting '/' read-only with options 'seclabel,data=ordered'.
[   14.300472] EXT4-fs (vda9): re-mounted. Opts: data=ordered
[   14.306517] Unregister pv shared memory for cpu 0
[   14.307076] reboot: Restarting system
[   14.307332] reboot: machine restart

I've only observed this with the nightly builds. I'm currently trying to reproduce locally and will update this bug once I ensure I can reproduce locally.

My guess is that qemu itself is dying for some reason.

ajeddeloh avatar Nov 21 '17 22:11 ajeddeloh

Is this a qemu-specific problem?

bgilbert avatar Nov 28 '17 01:11 bgilbert

It looks like it - I can't find any failures of this kind on other platforms. qemu_uefi is hard to tell since those time out half the time (woo, another bug).

I'm fairly sure this is the reason we get the a lot of weird test flakes, especially the coreos.verity.* ones.

ajeddeloh avatar Nov 28 '17 19:11 ajeddeloh

Looking at console output from failed nightly builds now that https://github.com/coreos/mantle/pull/775 has merged, it appears qemu is either exiting with status 0, is being killed in error, or is getting OOM-killed. Logs from the jenkins worker do not show it being OOM-killed.

ajeddeloh avatar Dec 05 '17 19:12 ajeddeloh

Confirmed kola does log stderr from qemu to stderr as well.

ajeddeloh avatar Dec 07 '17 00:12 ajeddeloh