cockroach
cockroach copied to clipboard
roachtest: fail tests if monitor encounters an error
This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is not canceled); and 2) return any errors encountered by the roachprod monitor in roachtest, causing the currently running test to fail. The error has TestEng ownership so that teams are not be pinged on these kinds of flakes.
The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub.
Fixes: #118563.
Release note: None
Verified that the logic here works by simulating a VM preemption (i.e., running a roachtest and manually deleting one of the VMs on GCE). 0.1 build is also currently in progress.
Do others think this approach is reasonable? I also played with a different approach where roachtest continuously monitors for preempted VMs, but I think it's more general to have the monitor cause the test to fail on errors, which should fix the timeouts we observed with VM preemption.
Let me know!
Do others think this approach is reasonable? I also played with a different approach where roachtest continuously monitors for preempted VMs, but I think it's more general to have the monitor cause the test to fail on errors, which should fix the timeouts we observed with VM preemption.
Let me know!
I like the current approach for its parsimony! It's also more general, i.e., not specific to preemption. Thus, we can now monitor (pun intended) for these types of "infra flake" while at the same time reducing the noise due to unattributed test failures.
TFTR!
bors r=srosenberg