sst-elements
sst-elements copied to clipboard
Ember: CommSplitevent is failing
1 - Detailed description of problem or enhancement
enQ_commSplit event is not functioning as expected. The subsequent enQ_rank() and enQ_size() functions used to obtain the rank of size of the new communicator processes have stale/wrong values.
2 - Describe how to reproduce
I have included the test motifs used to check the functionality of enQ_commSplit event. The motif puts each process into two different communicators and dumps out the local rank and size of those communicators. I have also included the test configuration script to run this motif. Update the Makefile.am to include the two files ("mpi/motifs/embercommsplittest.h ; mpi/motifs/embercommsplittest.cc" ) before building Ember.
Finally, I have also included a sample MPI program which represents the actual motif for validation. Compiling and running the mpicommsplittest.cc will dump the actual expected output.
Steps to reproduce:
- Update Makefile.am to include embercommsplittest.cc and embercommsplittest.h files.
- make all && make install
- cd tests && sst -v dragon_32_commsplittest.py
- SST/Actual output:
Rank: 31 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1897463408 dense_allreduce_group_size: 1897463472 mp_group_rank: 32592 mp_group_size: 1897463536 Rank: 30 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1718187119 dense_allreduce_group_size: 3440 mp_group_rank: 0 mp_group_size: 80 Rank: 29 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 109 dense_allreduce_group_size: 4048 mp_group_rank: 0 mp_group_size: 80 Rank: 28 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 23 dense_allreduce_group_size: 23 mp_group_rank: 0 mp_group_size: 1953451378 Rank: 27 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1118797312 dense_allreduce_group_size: 0 mp_group_rank: 21958 mp_group_size: 0 Rank: 26 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1107322473 dense_allreduce_group_size: 1109473616 mp_group_rank: 21958 mp_group_size: 81 Rank: 25 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1953451346 dense_allreduce_group_size: 1098640784 mp_group_rank: 26217 mp_group_size: 81 Rank: 24 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1769238349 dense_allreduce_group_size: 1280 mp_group_rank: 102 mp_group_size: 80 Rank: 23 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1711302249 dense_allreduce_group_size: 1110017616 mp_group_rank: 26112 mp_group_size: 81 Rank: 22 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 2077162720 dense_allreduce_group_size: 0 mp_group_rank: 32592 mp_group_size: 0 Rank: 21 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1113782176 dense_allreduce_group_size: 0 mp_group_rank: 21958 mp_group_size: 0 Rank: 20 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1953451380 dense_allreduce_group_size: 1101149040 mp_group_rank: 26217 mp_group_size: 145 Rank: 19 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1797346712 dense_allreduce_group_size: 1797346848 mp_group_rank: 32592 mp_group_size: 1797346984 Rank: 18 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1110162408 dense_allreduce_group_size: 1118800832 mp_group_rank: 21958 mp_group_size: 81 Rank: 17 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 161 dense_allreduce_group_size: 1118800416 mp_group_rank: 0 mp_group_size: 1118805504 Rank: 16 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1718187119 dense_allreduce_group_size: 1099305984 mp_group_rank: 32512 mp_group_size: 81 Rank: 7 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1797259366 dense_allreduce_group_size: 1109398144 mp_group_rank: 32592 mp_group_size: 161 Rank: 6 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 10 dense_allreduce_group_size: 1280524622 mp_group_rank: 0 mp_group_size: 1795188329 Rank: 5 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 161 dense_allreduce_group_size: 1118784752 mp_group_rank: 0 mp_group_size: 2077162720 Rank: 4 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1869182049 dense_allreduce_group_size: 560 mp_group_rank: 29550 mp_group_size: 81 Rank: 3 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 115201 dense_allreduce_group_size: 1696 mp_group_rank: 0 mp_group_size: 80 Rank: 0 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 18 dense_allreduce_group_size: 18 mp_group_rank: 0 mp_group_size: 6711668 Rank: 1 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1118812624 dense_allreduce_group_size: 560 mp_group_rank: 21958 mp_group_size: 80 Rank: 2 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1118784768 dense_allreduce_group_size: 1113049072 mp_group_rank: 21958 mp_group_size: 17 Rank: 8 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1118815344 dense_allreduce_group_size: 1118815312 mp_group_rank: 21958 mp_group_size: 9 Rank: 9 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1797383704 dense_allreduce_group_size: 1797383840 mp_group_rank: 32592 mp_group_size: 1797383976 Rank: 10 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1797420968 dense_allreduce_group_size: 1797421104 mp_group_rank: 32592 mp_group_size: 1797421240 Rank: 11 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1797403560 dense_allreduce_group_size: 560 mp_group_rank: 32592 mp_group_size: 80 Rank: 12 World size: 32 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 0 dense_allreduce_group_size: 1114383536 mp_group_rank: 0 mp_group_size: 20 Rank: 13 World size: 32 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1797458776 dense_allreduce_group_size: 1797458912 mp_group_rank: 32592 mp_group_size: 1797459048 Rank: 14 World size: 32 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1797496040 dense_allreduce_group_size: 1797496176 mp_group_rank: 32592 mp_group_size: 1797496312 Rank: 15 World size: 32 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1797478632 dense_allreduce_group_size: 560 mp_group_rank: 32592 mp_group_size: 80
- Actual Output (obtained by: mpicxx mpicommsplittest.cpp -o commsplittest && mpirun -np 32 commsplittest):
Rank: 2 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 0 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4 Rank: 4 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 1 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4 Rank: 5 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 1 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4 Rank: 12 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 3 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4 Rank: 18 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 4 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4 Rank: 19 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 4 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4 Rank: 21 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 5 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4 Rank: 22 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 5 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4 Rank: 0 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 0 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4 Rank: 1 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 0 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4 Rank: 3 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 0 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4 Rank: 6 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 1 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4 Rank: 7 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 1 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4 Rank: 8 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 2 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4 Rank: 9 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 2 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4 Rank: 10 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 2 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4 Rank: 11 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 2 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4 Rank: 13 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 3 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4 Rank: 14 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 3 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4 Rank: 15 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 3 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4 Rank: 16 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 4 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4 Rank: 17 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 4 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4 Rank: 20 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 5 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4 Rank: 23 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 5 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4 Rank: 24 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 6 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4 Rank: 25 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 6 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4 Rank: 26 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 6 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4 Rank: 27 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 6 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4 Rank: 28 dense_allreduce_group_color: 0 dense_allreduce_group_rank: 7 dense_allreduce_group_size: 8 mp_group_rank: 0 mp_group_size: 4 Rank: 29 dense_allreduce_group_color: 1 dense_allreduce_group_rank: 7 dense_allreduce_group_size: 8 mp_group_rank: 1 mp_group_size: 4 Rank: 30 dense_allreduce_group_color: 2 dense_allreduce_group_rank: 7 dense_allreduce_group_size: 8 mp_group_rank: 2 mp_group_size: 4 Rank: 31 dense_allreduce_group_color: 3 dense_allreduce_group_rank: 7 dense_allreduce_group_size: 8 mp_group_rank: 3 mp_group_size: 4
3 - What Operating system(s) and versions
Ubuntu 22.04.3 LTS
dragon_32_commsplittest.txt embercommsplittest.txt embercommsplittest_header.txt mpicommsplittest.txt
Any update on this issue?
I took a quick look at the code. It's been a while since I've worked with ember and its quirks, but I think what's happening is the enqueued events haven't actually executed at the time you're printing out the values. the enQ_* functions just add things to a queue of events to execute, but none of those will actually get executed until after the generate function returns. If you put the print out in the generate function to run the second time generate gets called, all the enqueued events will have run and the values should be correct at that point. You could do the print when m_loopIndex == 1, or you could just put it in the top of the if (m_loopIndex == m_iterations) loop and it will print before the last iteration.