cabal icon indicating copy to clipboard operation
cabal copied to clipboard

`PackageTests/NewBuild/CmdRun/Terminate` brittle, may fail on CI

Open andreasabel opened this issue 3 years ago • 9 comments

The test PackageTests/NewBuild/CmdRun/Terminate has been caught to fail on CI. Maybe it has races? There was a previous attempt to fix this issue:

  • #8080

Ping @robx .

https://github.com/haskell/cabal/runs/7945613288?check_suite_focus=true#step:17:35

...
threads: 2
tests to run: 427
PackageTests/NewBuild/CmdRun/Terminate/cabal.test.hs                                                              FAIL (12.33s)
...
Building executable 'exe' for RunKill-1.0..
...
Linking /home/runner/work/cabal/cabal/cabal-testsuite/PackageTests/NewBuild/CmdRun/Terminate/dist-newstyle/build/x86_64-linux/ghc-8.8.4/RunKill-1.0/x/exe/build/exe/exe ...
+ /home/runner/work/cabal/cabal/dist-newstyle-validate-ghc-8.8.4/build/x86_64-linux/ghc-8.8.4/cabal-install-3.9.0.0/x/cabal/build/cabal/cabal v2-run exe
...
about to sleep
done sleeping
exiting
cabal: waitForProcess: does not exist (No child processes)
terminated

stderr:
*** Exception: unexpected: done sleeping
CallStack (from HasCallStack):
  assertOutputDoesNotContain, called at PackageTests/NewBuild/CmdRun/Terminate/cabal.test.hs:35:3 in main:Main

*** unexpected failure for PackageTests/NewBuild/CmdRun/Terminate/cabal.test.hs

andreasabel avatar Aug 22 '22 07:08 andreasabel

Judging from the ticket and the error line (34 vs. 35), the two runs are slightly different:

Up to date
terminated

vs.

about to sleep
done sleeping
exiting
cabal: waitForProcess: does not exist (No child processes)
terminated

ffaf1 avatar Aug 22 '22 08:08 ffaf1

Thanks for catching this!

Judging from the ticket and the error line (34 vs. 35), the two runs are slightly different:

The test has changed in the meantime, so a changed output is to be expected. I do think the fix was good, and this is separate issue.

It would be good to know how often this happens, has it been observed more often than this once? It might be something as annoying as the write to disk taking more than a couple of seconds, or the controlling process randomly being paused for multiple seconds due to OS scheduling.

I think the best next step to narrow this down would be to instrument the test a bit with timestamped logging, unless this can be reproduced somewhat reliably locally.

robx avatar Aug 22 '22 08:08 robx

@andreasabel Do you have any info on how frequent this is?

robx avatar Aug 22 '22 14:08 robx

@andreasabel Do you have any info on how frequent this is?

I just caught it and reported it, don't know how frequently it occurs.

andreasabel avatar Aug 22 '22 16:08 andreasabel

Ah good then, considering the "high priority" label I was worried that things might be on fire.

robx avatar Aug 22 '22 21:08 robx

I added "priority: high" because CI shouldn't fail (as this will stop the whole merge process...).

andreasabel avatar Aug 23 '22 21:08 andreasabel

I haven't seen this test fail lately. Is there a way to grep CI for this failure?

I think this may no longer be relevant.

alt-romes avatar Feb 16 '24 15:02 alt-romes

Seen it 2024-04-12 in

  • #9886

andreasabel avatar Apr 12 '24 09:04 andreasabel

Hit it in #9949: https://github.com/haskell/cabal/actions/runs/8877985809/job/24372711684?pr=9949

ulysses4ever avatar Apr 29 '24 12:04 ulysses4ever