ompi
ompi copied to clipboard
Assertion `reserve > 0' failed running collective-big-count tests using v4.1.x branch and --mca coll adapt,basic,sm,self,inter,libnbc option
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI v4.1.x branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built from current v4.1.x branch (3/22/22)
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
git submodule status does not display anything.
Please describe the system on which you are running
- Operating system/version:
- RHEL 8.4
- Computer hardware:
- Single Power8 node
- Network type:
- Localhost
Details of the problem
I ran the set of self-checking tests from ompi-tests-public/collective-big-count with collective components specified as --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc
The following environment variables were set for all tests:
BIGCOUNT_HOSTS : -np 3 BIGCOUNT_MEMORY_PERCENT : 70 BIGCOUNT_MEMORY_DIFF : 10
For instance, I ran this command
mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count
The command failed with this assert and traceback
test_allgather_uniform_count: ../../../../ompi/mca/coll/base/coll_base_util.h:73: ompi_coll_base_nbc_reserve_tags: Assertion `reserve > 0' failed.
[c656f6n01:2537658] *** Process received signal ***
[c656f6n01:2537658] Signal: Aborted (6)
[c656f6n01:2537658] Signal code: (-6)
[c656f6n01:2537658] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:2537658] [ 1] /lib64/libc.so.6(gsignal+0xd8)[0x2000003c44d8]
[c656f6n01:2537658] [ 2] /lib64/libc.so.6(abort+0x164)[0x2000003a462c]
[c656f6n01:2537658] [ 3] /lib64/libc.so.6(+0x37c70)[0x2000003b7c70]
[c656f6n01:2537658] [ 4] /lib64/libc.so.6(__assert_fail+0x64)[0x2000003b7d14]
[c656f6n01:2537658] [ 5] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(+0x544c)[0x200002ef544c]
[c656f6n01:2537658] [ 6] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(+0x76a4)[0x200002ef76a4]
[c656f6n01:2537658] [ 7] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(ompi_coll_adapt_ibcast+0x12c)[0x200002ef7118]
[c656f6n01:2537658] [ 8] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(ompi_coll_adapt_bcast+0x70)[0x200002ef3a30]
[c656f6n01:2537658] [ 9] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(ompi_coll_base_allgather_intra_basic_linear+0x22c)[0x2000001fc32c]
[c656f6n01:2537658] [10] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(MPI_Allgather+0x3c0)[0x200000129ec4]
[c656f6n01:2537658] [11] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002fc4]
[c656f6n01:2537658] [12] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002814]
[c656f6n01:2537658] [13] /lib64/libc.so.6(+0x24c78)[0x2000003a4c78]
[c656f6n01:2537658] [14] /lib64/libc.so.6(__libc_start_main+0xb4)[0x2000003a4e64]
[c656f6n01:2537658] *** End of error message ***
The following testcases had this failure
- test_allgather_uniform_count
- test-allreduce-uniform_count
- test-bcast-uniform-count
- test-reduce-uniform-count
The tests were compiled by running make in the directory containing the source files
@drwootton Were any of these issues fixed on main and could be back-ported to the v4.0.x / v4.1.x branches?
@jsquyres I did not see this failure in any tests with the main branch other than once in test-allgather-uniform-count, so the problem may be fixed in main. I don't see any failures with the main branch for either test-bcast-uniform-count and test-reduce-uniform-count. I see the same (or very similar) failure for test-reduce-uniform-count in issue #10186. I can't tell if the problem is fixed for test-allgather-uniform-count or whether the other failure with that test in main is before the code gets to the point where this problem occurs.