tmt icon indicating copy to clipboard operation
tmt copied to clipboard

Aborted test is sometimes reported as a passed test

Open psss opened this issue 2 years ago • 7 comments

It seems that tmt-abort sometimes does not work as expected when used with the virtual provision plugin and reports the abort test with a pass. Here's a simple config:

/plan:
    provision:
        how: virtual
    discover:
        how: fmf
    execute:
        how: tmt

/test:
    /1:
        test: /bin/true
    /2:
        test: tmt-abort
    /3:
        test: /bin/true

Report then looks like this:

report
    how: display
        pass /test/1
        pass /test/2
        pass /test/3
    summary: 3 tests passed

Adding some sleep after the tmt-abort command seems to make it working. But didn't have more time to investiage more properly. @pfdaly, could you possibly have a look?

psss avatar Sep 29 '23 12:09 psss

Interesting. @psss I'll start looking into this issue now.

pfdaly avatar Oct 02 '23 11:10 pfdaly

I performed some investigatory testing of this functionality, using the fmf file layout above for testing.

Successful execution of the tmt-abort per environment was the following:

  • local: 10/10
  • container: 10/10
  • virtual: 5/10

I updated the tmt-abort script to include a 'sleep 15' and also added a print statement into execute/internal.py to ensure that the correct test command was indeed being executed, which based on the output it is indeed.

Test Command: set -eo pipefail; tmt-abort

Upon successful execution abortion, I can see that the total test time is approximately equal to that of the sleep within the tmt-abort script.

Checking for abort file for test: /test/2
                00:00:14 errr /test/2 (on default-0) (aborted) [2/3]
        warn: Test /test/2 aborted, stopping execution

In instances where the test was not aborted, I can see that the total test time is zero.

Checking for abort file for test: /test/2
                00:00:00 pass /test/2 (on default-0) [2/3]

This led me to believe that following the transfer of the scripts to the test environment with rsync that the data had not been written to disk by the time the script was called during test execution. As such, I added an execution of the sync command for each script into provision/init.py:GuestSsh.push() and a sleep of 10

Workdir '/var/tmp/tmt/run-224/plan/execute/default-0' created.
        Copy '/usr/lib/python3.11/site-packages/tmt/steps/execute/scripts/tmt-abort' to '/usr/local/bin/tmt-abort' on the guest.
        ##### About to sync system
        ##### destination: /usr/local/bin/tmt-abort
        Execute command 'export TMT_PLAN_DATA=/var/tmp/tmt/run-224/plan/data; export TMT_TREE=/var/tmp/tmt/run-224/plan/tree; export TMT_VERSION=1.28.dev33+gf108f2b5; sync -d /usr/local/bin/tmt-abort' on guest '127.0.0.1'.
            cmd: sync -d /usr/local/bin/tmt-abort
        ##### Sleep value: 10

However, tmt-abort still occasionally does not abort successfully, and again the total runtime is incorrect.

cmd: tmt-abort
                out: Shared connection to 127.0.0.1 closed.
        Copy '/var/tmp/tmt/run-224/plan/execute/data/guest/default-0/test/2-2' from the guest to '/'.
        Extract results of '/test/2'.
        tmt-report-results file '/var/tmp/tmt/run-224/plan/execute/data/guest/default-0/test/2-2/data/restraint-result' does not exist.
        Checking for abort file for test: /test/2
                00:00:00 pass /test/2 (on default-0) [2/3]

I know the sync command is being executed on the test environment, as I checked that by adding in a call to the 'uname -a' command and confirmed it is indeed different to my localhost. out: Linux default-0 6.2.9-300.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Mar 30 22:32:58 UTC 2023 x86_64 GNU/Linux

Before this, I had tried to just add a sleep into a bash script which only calls the tmt-abort command and then sleeps for 10, but the tmt-abort once again failed to execute during execution.

cmd: ./test.sh
                out: Shared connection to 127.0.0.1 closed.
        Copy '/var/tmp/tmt/run-231/plan/execute/data/guest/default-0/test/2-2' from the guest to '/'.
        Extract results of '/test/2'.
        tmt-report-results file '/var/tmp/tmt/run-231/plan/execute/data/guest/default-0/test/2-2/data/restraint-result' does not exist.
        Checking for abort file for test: /test/2
                00:00:00 pass /test/2 (on default-0) [2/3]

If the tmt-abort command had of failed then I would have expected the runtime to be 10 on account of the subsequent sleep in the script.

Honestly, I'm unsure as to what is causing this issue.

pfdaly avatar Oct 04 '23 12:10 pfdaly

I've been able to reproduce the behavior. I added a couple of -vvv to involved SSH commands, but so far no sign of the cause.

happz avatar Oct 04 '23 14:10 happz

I finally started to looking into this one

thrix avatar Nov 28 '23 23:11 thrix

@psss @happz can you still reproduce this problem? I just tried with the plan from the description and:

❯ while true; do tmt run && break; done

And I see 40+ well aborted runs :( not sign of it passing

thrix avatar Nov 29 '23 14:11 thrix

Hm, interesting, I've tried this three times and I always get:

report
    how: display
        pass /test/1
        pass /test/2
        pass /test/3
    summary: 3 tests passed

psss avatar Nov 29 '23 15:11 psss

Hm, I was able to reproduce with virtual as well and on the guest the 'abort' file is not created.

However adding '-x' to the shabang in tmt-abort script and changing the reproducer to tmt-abort &> /tmp/ABORT made me unable to reproduce the problem again :/

lukaszachy avatar Jan 16 '24 10:01 lukaszachy

Hm, could this possibly be related to #2998?

psss avatar Jun 11 '24 08:06 psss

Confirmed, reproduced with provision -h local which runs faster, works nicely with the latest main.

psss avatar Jun 11 '24 08:06 psss