test: convert testsuite to be function based
Pull Request Description
The current testsuite consists of thousands of individual mpi test programs. Running the entire testsuite involves invoking process manager to spawn mpi processes and each process goes through MPI_INIT again and again. Both the process spawning and MPI initialization are very slow compared to the tested MPI operation itself. The current testsuite runs for a couple of hours and we run hundreds of them every day.
This PR attempts to convert individual tests into functions, so multiple tests can be tested within a single MPI_Init/Finalize window. I believe this can significantly reduce the CI testing time.
Design goals
-
The current testsuite workflow should still work.
maketarget for each individual test is still maintained by linkingutil/run_mpitests.o, which supplies themainfunction. Individual test targets should work exactly as before. -
An additional
maketarget --run_mpitests-- is provided by linkingutil/run_mpitests.o,all_mpitests.o, and all the individual test objects.run_mpitestscan be invoked to run multiple tests in a singlempiruninvocation. We'll invoke it separately with a different number of processes for test coverages. -
To keep the current workflow as much as possible,
run_mpitestswill be driven by the Perl scriptruntests, thus with the sametestlistmanagement.runtestsstill runs the legacy tests by invokingmpirunindividually; but for new tests, it will aggregate and invokerun_mpitests. The test and its argument is piped in usingstdinand result piped out usingstdout
Challenges
- [x] TIMEOUT -- use
alarm() - [x] Crashes -- will restart
run_mpitestsat the next test index - [x] CVARs --
run_mpiteststo accept commands to set and reset CVAR
Status
- Only one test converted:
attr/attrend, but workflow can be tested now.
Fixes #2142
Author Checklist
- [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [x] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/ch3/tcp ✔️ test:mpich/ch4/ofi ❌
test:mpich/ch3/tcp test:mpich/ch4/ofi ✔️ Good except the ubsan catches a fix in posix release_gather code. Will fix in the next push.
Comparing the timing:
- old ch4:ofi:
Looking in ./testlist [00:00:00]
Processing directory attr
Looking in ./attr/testlist [00:00:00]
Looking in ./attr/testlist.dtp [00:00:38]
Processing directory coll
Looking in ./coll/testlist [00:00:42]
Looking in ./coll/testlist.dtp [00:09:04]
Looking in ./coll/testlist.cvar [00:12:02]
Processing directory comm
Looking in ./comm/testlist [00:44:22]
...
- new ch4:ofi:
Running tests in ./attr/testlist [20 tests - 00:00:00]
run_mpitests np=1, 17 tests...
run_mpitests np=2, 1 tests...
run_mpitests np=4, 2 tests...
Running tests in ./attr/testlist.dtp [2 tests - 00:00:02]
run_mpitests np=1, 2 tests...
Running tests in ./coll/testlist [185 tests - 00:00:03]
run_mpitests np=1, 6 tests...
run_mpitests np=2, 9 tests...
run_mpitests np=4, 62 tests...
run_mpitests np=5, 35 tests...
run_mpitests np=6, 2 tests...
run_mpitests np=7, 14 tests...
run_mpitests np=8, 28 tests...
run_mpitests np=10, 29 tests...
Running tests in ./coll/testlist.cvar [867 tests - 00:00:57]
run_mpitests np=1, 6 tests...
run_mpitests np=2, 6 tests...
run_mpitests np=4, 243 tests...
run_mpitests np=5, 138 tests...
run_mpitests np=6, 10 tests...
run_mpitests np=7, 67 tests...
run_mpitests np=8, 97 tests...
run_mpitests np=10, 300 tests...
Running tests in ./coll/testlist.dtp [12 tests - 00:01:56]
run_mpitests np=4, 6 tests...
run_mpitests np=10, 6 tests...
Running tests in ./attr/testlist [2 tests - 00:05:28]
Running tests in ./coll/testlist [4 tests - 00:05:30]
Running tests in ./comm/testlist [45 tests - 00:06:33]
This means we shortened the testing time of (attr+coll) from 44:22 down to 6:33, nearly 7 fold. The dtp tests still takes time, that's unavoidable. But for small tests, e.g. tests in attr/, it's 38 sec -> 3 sec.
ch4:ofi-centos64 total testing time shortens from 1hr47'16" down to 1hr16'2"
ch3:tcp-centos64 total testing time shortens from 51'34" down to 31'4"
On ch3:tcp-solaris, attr+coll shortens from 1hr4'37" down to 9'37"! However, the rest of tests (with no changes) took 1 hour longer than comparison -- I guess runtime congestion played significant role on solaris.
Reference for solaris slow init: https://github.com/pmodels/mpich/pull/5645#issuecomment-995039127
Added timeout. Experiment with setting default timeout to 10s --
~/work/pull_requests/2111_mpitests/test/mpi/coll$ make MPITEST_TIMEOUT=10 testing
../runtests -srcdir=. -tests=testlist,testlist.dtp,testlist.cvar -testdirs= \
-mpiexec="/home/hzhou/MPI/bin/mpiexec" -xmlfile=summary.xml \
-tapfile=summary.tap -junitfile=summary.junit.xml
Load tests in .
Running tests in ./testlist [185 tests - 00:00:00]
run_mpitests np=1, 6 tests...
run_mpitests np=2, 9 tests...
run_mpitests np=4, 62 tests...
run_mpitests np=5, 35 tests...
run_mpitests np=6, 2 tests...
run_mpitests np=7, 14 tests...
run_mpitests np=8, 28 tests...
run_mpitests np=10, 29 tests...
run_mpitests exited unexpectedly [coll/nonblocking3]
Failed test: coll/nonblocking3
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 834322 RUNNING AT Tiger
= EXIT CODE: 14
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Alarm clock (signal 14)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
run_mpitests np=10, 29 tests, continue at 25...
Running tests in ./testlist.cvar [867 tests - 00:00:25]
run_mpitests np=1, 6 tests...
run_mpitests np=2, 6 tests...
run_mpitests np=4, 243 tests...
run_mpitests np=5, 138 tests...
run_mpitests np=6, 10 tests...
run_mpitests np=7, 67 tests...
run_mpitests np=8, 97 tests...
run_mpitests np=10, 300 tests...
Running tests in ./testlist.dtp [12 tests - 00:00:39]
run_mpitests np=4, 6 tests...
run_mpitests np=10, 6 tests...
Running tests in ./testlist [3 tests - 00:03:04]
1 tests failed out of 1067 (total runtime: 3 min 6 sec)
Details in /home/hzhou/work/pull_requests/2111_mpitests/test/mpi/coll/summary.xml
TAP formatted results in /home/hzhou/work/pull_requests/2111_mpitests/test/mpi/coll/summary.tap
JUNIT formatted results in /home/hzhou/work/pull_requests/2111_mpitests/test/mpi/coll/summary.junit.xml
The dtp tests didn't timeout because most of them have per-test timeout.
test:mpich/ch3/tcp ❌ test:mpich/ch4/ofi ✔️
With ch3-tcp: 5 failures on freebsd64:
summary_junit_xml.1191 - ./rma/atomic_rmw_gacc 3 | 0.29 sec | 1
summary_junit_xml.1192 - ./rma/atomic_get 3 -pairtype=short | 1.4 sec | 1
summary_junit_xml.1194 - ./rma/atomic_get 3 -pairtype=long | 1.3 sec | 1
summary_junit_xml.1196 - ./rma/atomic_get 3 -pairtype=double | 1.3 sec | 1
summary_junit_xml.1284 - ./rma/mutex_bench 4 -use-alloc-shm -use-contig-rank | 3 min 0 sec | 1
When freebsd frees a window and recreates it, it appears that the interprocess mutex no longer work.
Dramatic speedup on Solaris. From 97 minutes down to 50 minutes. It's 23 minutes on centos
Does this change anything for the xfail system?
No.
test:mpich/ch3/tcp test:mpich/ch4/ofi ✔️