Aborted test is sometimes reported as a passed test
It seems that tmt-abort sometimes does not work as expected when used with the virtual provision plugin and reports the abort test with a pass. Here's a simple config:
/plan:
provision:
how: virtual
discover:
how: fmf
execute:
how: tmt
/test:
/1:
test: /bin/true
/2:
test: tmt-abort
/3:
test: /bin/true
Report then looks like this:
report
how: display
pass /test/1
pass /test/2
pass /test/3
summary: 3 tests passed
Adding some sleep after the tmt-abort command seems to make it working. But didn't have more time to investiage more properly. @pfdaly, could you possibly have a look?
Interesting. @psss I'll start looking into this issue now.
I performed some investigatory testing of this functionality, using the fmf file layout above for testing.
Successful execution of the tmt-abort per environment was the following:
- local: 10/10
- container: 10/10
- virtual: 5/10
I updated the tmt-abort script to include a 'sleep 15' and also added a print statement into execute/internal.py to ensure that the correct test command was indeed being executed, which based on the output it is indeed.
Test Command: set -eo pipefail; tmt-abort
Upon successful execution abortion, I can see that the total test time is approximately equal to that of the sleep within the tmt-abort script.
Checking for abort file for test: /test/2
00:00:14 errr /test/2 (on default-0) (aborted) [2/3]
warn: Test /test/2 aborted, stopping execution
In instances where the test was not aborted, I can see that the total test time is zero.
Checking for abort file for test: /test/2
00:00:00 pass /test/2 (on default-0) [2/3]
This led me to believe that following the transfer of the scripts to the test environment with rsync that the data had not been written to disk by the time the script was called during test execution. As such, I added an execution of the sync command for each script into provision/init.py:GuestSsh.push() and a sleep of 10
Workdir '/var/tmp/tmt/run-224/plan/execute/default-0' created.
Copy '/usr/lib/python3.11/site-packages/tmt/steps/execute/scripts/tmt-abort' to '/usr/local/bin/tmt-abort' on the guest.
##### About to sync system
##### destination: /usr/local/bin/tmt-abort
Execute command 'export TMT_PLAN_DATA=/var/tmp/tmt/run-224/plan/data; export TMT_TREE=/var/tmp/tmt/run-224/plan/tree; export TMT_VERSION=1.28.dev33+gf108f2b5; sync -d /usr/local/bin/tmt-abort' on guest '127.0.0.1'.
cmd: sync -d /usr/local/bin/tmt-abort
##### Sleep value: 10
However, tmt-abort still occasionally does not abort successfully, and again the total runtime is incorrect.
cmd: tmt-abort
out: Shared connection to 127.0.0.1 closed.
Copy '/var/tmp/tmt/run-224/plan/execute/data/guest/default-0/test/2-2' from the guest to '/'.
Extract results of '/test/2'.
tmt-report-results file '/var/tmp/tmt/run-224/plan/execute/data/guest/default-0/test/2-2/data/restraint-result' does not exist.
Checking for abort file for test: /test/2
00:00:00 pass /test/2 (on default-0) [2/3]
I know the sync command is being executed on the test environment, as I checked that by adding in a call to the 'uname -a' command and confirmed it is indeed different to my localhost.
out: Linux default-0 6.2.9-300.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Mar 30 22:32:58 UTC 2023 x86_64 GNU/Linux
Before this, I had tried to just add a sleep into a bash script which only calls the tmt-abort command and then sleeps for 10, but the tmt-abort once again failed to execute during execution.
cmd: ./test.sh
out: Shared connection to 127.0.0.1 closed.
Copy '/var/tmp/tmt/run-231/plan/execute/data/guest/default-0/test/2-2' from the guest to '/'.
Extract results of '/test/2'.
tmt-report-results file '/var/tmp/tmt/run-231/plan/execute/data/guest/default-0/test/2-2/data/restraint-result' does not exist.
Checking for abort file for test: /test/2
00:00:00 pass /test/2 (on default-0) [2/3]
If the tmt-abort command had of failed then I would have expected the runtime to be 10 on account of the subsequent sleep in the script.
Honestly, I'm unsure as to what is causing this issue.
I've been able to reproduce the behavior. I added a couple of -vvv to involved SSH commands, but so far no sign of the cause.
I finally started to looking into this one
@psss @happz can you still reproduce this problem? I just tried with the plan from the description and:
❯ while true; do tmt run && break; done
And I see 40+ well aborted runs :( not sign of it passing
Hm, interesting, I've tried this three times and I always get:
report
how: display
pass /test/1
pass /test/2
pass /test/3
summary: 3 tests passed
Hm, I was able to reproduce with virtual as well and on the guest the 'abort' file is not created.
However adding '-x' to the shabang in tmt-abort script and changing the reproducer to tmt-abort &> /tmp/ABORT made me unable to reproduce the problem again :/
Hm, could this possibly be related to #2998?
Confirmed, reproduced with provision -h local which runs faster, works nicely with the latest main.