mpich icon indicating copy to clipboard operation
mpich copied to clipboard

test: convert testsuite to be function based

Open hzhou opened this issue 4 years ago • 7 comments

Pull Request Description

The current testsuite consists of thousands of individual mpi test programs. Running the entire testsuite involves invoking process manager to spawn mpi processes and each process goes through MPI_INIT again and again. Both the process spawning and MPI initialization are very slow compared to the tested MPI operation itself. The current testsuite runs for a couple of hours and we run hundreds of them every day.

This PR attempts to convert individual tests into functions, so multiple tests can be tested within a single MPI_Init/Finalize window. I believe this can significantly reduce the CI testing time.

Design goals

  • The current testsuite workflow should still work. make target for each individual test is still maintained by linking util/run_mpitests.o, which supplies the main function. Individual test targets should work exactly as before.

  • An additional make target -- run_mpitests -- is provided by linking util/run_mpitests.o, all_mpitests.o, and all the individual test objects. run_mpitests can be invoked to run multiple tests in a single mpirun invocation. We'll invoke it separately with a different number of processes for test coverages.

  • To keep the current workflow as much as possible, run_mpitests will be driven by the Perl script runtests, thus with the same testlist management. runtests still runs the legacy tests by invoking mpirun individually; but for new tests, it will aggregate and invoke run_mpitests. The test and its argument is piped in using stdin and result piped out using stdout

Challenges

  • [x] TIMEOUT -- use alarm()
  • [x] Crashes -- will restart run_mpitests at the next test index
  • [x] CVARs -- run_mpitests to accept commands to set and reset CVAR

Status

  • Only one test converted: attr/attrend, but workflow can be tested now.

Fixes #2142

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Dec 09 '21 23:12 hzhou

test:mpich/ch3/tcp ✔️ test:mpich/ch4/ofi ❌

hzhou avatar Dec 22 '21 03:12 hzhou

test:mpich/ch3/tcp test:mpich/ch4/ofi ✔️ Good except the ubsan catches a fix in posix release_gather code. Will fix in the next push.

Comparing the timing:

  • old ch4:ofi:
Looking in ./testlist 	[00:00:00]
Processing directory attr
Looking in ./attr/testlist 	[00:00:00]
Looking in ./attr/testlist.dtp 	[00:00:38]
Processing directory coll
Looking in ./coll/testlist 	[00:00:42]
Looking in ./coll/testlist.dtp 	[00:09:04]
Looking in ./coll/testlist.cvar 	[00:12:02]
Processing directory comm
Looking in ./comm/testlist 	[00:44:22]
...
  • new ch4:ofi:
Running tests in ./attr/testlist [20 tests - 00:00:00]
    run_mpitests np=1, 17 tests...
    run_mpitests np=2, 1 tests...
    run_mpitests np=4, 2 tests...
Running tests in ./attr/testlist.dtp [2 tests - 00:00:02]
    run_mpitests np=1, 2 tests...
Running tests in ./coll/testlist [185 tests - 00:00:03]
    run_mpitests np=1, 6 tests...
    run_mpitests np=2, 9 tests...
    run_mpitests np=4, 62 tests...
    run_mpitests np=5, 35 tests...
    run_mpitests np=6, 2 tests...
    run_mpitests np=7, 14 tests...
    run_mpitests np=8, 28 tests...
    run_mpitests np=10, 29 tests...
Running tests in ./coll/testlist.cvar [867 tests - 00:00:57]
    run_mpitests np=1, 6 tests...
    run_mpitests np=2, 6 tests...
    run_mpitests np=4, 243 tests...
    run_mpitests np=5, 138 tests...
    run_mpitests np=6, 10 tests...
    run_mpitests np=7, 67 tests...
    run_mpitests np=8, 97 tests...
    run_mpitests np=10, 300 tests...
Running tests in ./coll/testlist.dtp [12 tests - 00:01:56]
    run_mpitests np=4, 6 tests...
    run_mpitests np=10, 6 tests...
Running tests in ./attr/testlist [2 tests - 00:05:28]
Running tests in ./coll/testlist [4 tests - 00:05:30]
Running tests in ./comm/testlist [45 tests - 00:06:33]

This means we shortened the testing time of (attr+coll) from 44:22 down to 6:33, nearly 7 fold. The dtp tests still takes time, that's unavoidable. But for small tests, e.g. tests in attr/, it's 38 sec -> 3 sec.

ch4:ofi-centos64 total testing time shortens from 1hr47'16" down to 1hr16'2" ch3:tcp-centos64 total testing time shortens from 51'34" down to 31'4"

On ch3:tcp-solaris, attr+coll shortens from 1hr4'37" down to 9'37"! However, the rest of tests (with no changes) took 1 hour longer than comparison -- I guess runtime congestion played significant role on solaris.

Reference for solaris slow init: https://github.com/pmodels/mpich/pull/5645#issuecomment-995039127

hzhou avatar Dec 23 '21 05:12 hzhou

Added timeout. Experiment with setting default timeout to 10s --

~/work/pull_requests/2111_mpitests/test/mpi/coll$ make MPITEST_TIMEOUT=10 testing
../runtests -srcdir=. -tests=testlist,testlist.dtp,testlist.cvar -testdirs= \
        -mpiexec="/home/hzhou/MPI/bin/mpiexec"  -xmlfile=summary.xml \
        -tapfile=summary.tap -junitfile=summary.junit.xml
Load tests in .
Running tests in ./testlist [185 tests - 00:00:00]
    run_mpitests np=1, 6 tests...
    run_mpitests np=2, 9 tests...
    run_mpitests np=4, 62 tests...
    run_mpitests np=5, 35 tests...
    run_mpitests np=6, 2 tests...
    run_mpitests np=7, 14 tests...
    run_mpitests np=8, 28 tests...
    run_mpitests np=10, 29 tests...
run_mpitests exited unexpectedly [coll/nonblocking3]
    Failed test: coll/nonblocking3

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 834322 RUNNING AT Tiger
=   EXIT CODE: 14
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Alarm clock (signal 14)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
    run_mpitests np=10, 29 tests, continue at 25...
Running tests in ./testlist.cvar [867 tests - 00:00:25]
    run_mpitests np=1, 6 tests...
    run_mpitests np=2, 6 tests...
    run_mpitests np=4, 243 tests...
    run_mpitests np=5, 138 tests...
    run_mpitests np=6, 10 tests...
    run_mpitests np=7, 67 tests...
    run_mpitests np=8, 97 tests...
    run_mpitests np=10, 300 tests...
Running tests in ./testlist.dtp [12 tests - 00:00:39]
    run_mpitests np=4, 6 tests...
    run_mpitests np=10, 6 tests...
Running tests in ./testlist [3 tests - 00:03:04]
1 tests failed out of 1067 (total runtime: 3 min 6 sec)
Details in /home/hzhou/work/pull_requests/2111_mpitests/test/mpi/coll/summary.xml
TAP formatted results in /home/hzhou/work/pull_requests/2111_mpitests/test/mpi/coll/summary.tap
JUNIT formatted results in /home/hzhou/work/pull_requests/2111_mpitests/test/mpi/coll/summary.junit.xml

The dtp tests didn't timeout because most of them have per-test timeout.

hzhou avatar Dec 23 '21 17:12 hzhou

test:mpich/ch3/tcp ❌ test:mpich/ch4/ofi ✔️

With ch3-tcp: 5 failures on freebsd64:

summary_junit_xml.1191 - ./rma/atomic_rmw_gacc 3 | 0.29 sec | 1
summary_junit_xml.1192 - ./rma/atomic_get 3 -pairtype=short | 1.4 sec | 1
summary_junit_xml.1194 - ./rma/atomic_get 3 -pairtype=long | 1.3 sec | 1
summary_junit_xml.1196 - ./rma/atomic_get 3 -pairtype=double | 1.3 sec | 1
summary_junit_xml.1284 - ./rma/mutex_bench 4 -use-alloc-shm -use-contig-rank | 3 min 0 sec | 1

When freebsd frees a window and recreates it, it appears that the interprocess mutex no longer work.

Dramatic speedup on Solaris. From 97 minutes down to 50 minutes. It's 23 minutes on centos

hzhou avatar Dec 28 '21 01:12 hzhou

Does this change anything for the xfail system?

wesbland avatar Jan 04 '22 16:01 wesbland

No.

hzhou avatar Jan 04 '22 16:01 hzhou

test:mpich/ch3/tcp test:mpich/ch4/ofi ✔️

hzhou avatar Jan 10 '22 21:01 hzhou