ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Assertion `reserve > 0' failed running collective-big-count tests using v4.1.x branch and --mca coll adapt,basic,sm,self,inter,libnbc option

Open drwootton opened this issue 3 years ago • 2 comments

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI v4.1.x branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from current v4.1.x branch (3/22/22)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

git submodule status does not display anything.

Please describe the system on which you are running

  • Operating system/version:
  • RHEL 8.4
  • Computer hardware:
  • Single Power8 node
  • Network type:
  • Localhost

Details of the problem

I ran the set of self-checking tests from ompi-tests-public/collective-big-count with collective components specified as --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc

The following environment variables were set for all tests:

BIGCOUNT_HOSTS : -np 3 BIGCOUNT_MEMORY_PERCENT : 70 BIGCOUNT_MEMORY_DIFF : 10

For instance, I ran this command

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count

The command failed with this assert and traceback

test_allgather_uniform_count: ../../../../ompi/mca/coll/base/coll_base_util.h:73: ompi_coll_base_nbc_reserve_tags: Assertion `reserve > 0' failed.
[c656f6n01:2537658] *** Process received signal ***
[c656f6n01:2537658] Signal: Aborted (6)
[c656f6n01:2537658] Signal code:  (-6)
[c656f6n01:2537658] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:2537658] [ 1] /lib64/libc.so.6(gsignal+0xd8)[0x2000003c44d8]
[c656f6n01:2537658] [ 2] /lib64/libc.so.6(abort+0x164)[0x2000003a462c]
[c656f6n01:2537658] [ 3] /lib64/libc.so.6(+0x37c70)[0x2000003b7c70]
[c656f6n01:2537658] [ 4] /lib64/libc.so.6(__assert_fail+0x64)[0x2000003b7d14]
[c656f6n01:2537658] [ 5] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(+0x544c)[0x200002ef544c]
[c656f6n01:2537658] [ 6] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(+0x76a4)[0x200002ef76a4]
[c656f6n01:2537658] [ 7] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(ompi_coll_adapt_ibcast+0x12c)[0x200002ef7118]
[c656f6n01:2537658] [ 8] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(ompi_coll_adapt_bcast+0x70)[0x200002ef3a30]
[c656f6n01:2537658] [ 9] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(ompi_coll_base_allgather_intra_basic_linear+0x22c)[0x2000001fc32c]
[c656f6n01:2537658] [10] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(MPI_Allgather+0x3c0)[0x200000129ec4]
[c656f6n01:2537658] [11] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002fc4]
[c656f6n01:2537658] [12] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002814]
[c656f6n01:2537658] [13] /lib64/libc.so.6(+0x24c78)[0x2000003a4c78]
[c656f6n01:2537658] [14] /lib64/libc.so.6(__libc_start_main+0xb4)[0x2000003a4e64]
[c656f6n01:2537658] *** End of error message ***

The following testcases had this failure

  • test_allgather_uniform_count
  • test-allreduce-uniform_count
  • test-bcast-uniform-count
  • test-reduce-uniform-count

The tests were compiled by running make in the directory containing the source files

drwootton avatar Apr 05 '22 17:04 drwootton

@drwootton Were any of these issues fixed on main and could be back-ported to the v4.0.x / v4.1.x branches?

jsquyres avatar Apr 11 '22 14:04 jsquyres

@jsquyres I did not see this failure in any tests with the main branch other than once in test-allgather-uniform-count, so the problem may be fixed in main. I don't see any failures with the main branch for either test-bcast-uniform-count and test-reduce-uniform-count. I see the same (or very similar) failure for test-reduce-uniform-count in issue #10186. I can't tell if the problem is fixed for test-allgather-uniform-count or whether the other failure with that test in main is before the code gets to the point where this problem occurs.

drwootton avatar Apr 11 '22 19:04 drwootton