SOS
SOS copied to clipboard
Portals, CMA, and Remote VA
This is apparently a toxic combination. If you remove --enable-remote-virtual-addressing
all is well. Not sure why this breaks or whether it can actually be fixed...
$ mpiexec -np 2 -hosts compute-0-1,compute-0-2 -env SHMEM_DEBUG=1 -env SHMEM_INFO=1 test/unit/hello
Sandia OpenSHMEM 1.4.0rc2
SHMEM_INFO 1 (type: bool, default: 0)
Print library information message at startup
SHMEM_VERSION 0 (type: bool, default: 0)
Print library version at startup
SHMEM_DEBUG 1 (type: bool, default: 0)
Enable debugging messages
SHMEM_SYMMETRIC_SIZE 536870912 (type: size, default: 536870912)
Symmetric heap size
Additional options:
SHMEM_SYMMETRIC_HEAP_USE_HUGE_PAGES 0 (type: bool, default: 0)
Use Linux huge pages for symmetric heap
SHMEM_SYMMETRIC_HEAP_PAGE_SIZE 2097152 (type: size, default: 2097152)
Page size to use for huge pages
SHMEM_SYMMETRIC_HEAP_USE_MALLOC 0 (type: bool, default: 0)
Allocate the symmetric heap using malloc
SHMEM_BOUNCE_SIZE 2048 (type: size, default: 2048)
Maximum message size to bounce buffer
SHMEM_MAX_BOUNCE_BUFFERS 128 (type: long, default: 128)
Maximum number of bounce buffers per context
SHMEM_TRAP_ON_ABORT 0 (type: bool, default: 0)
Generate trap if the program aborts or calls shmem_global_exit
Collectives options:
SHMEM_COLL_CROSSOVER 4 (type: long, default: 4)
Crossover between linear and tree collectives
SHMEM_COLL_RADIX 4 (type: long, default: 4)
Radix for tree-based collectives
SHMEM_BARRIER_ALGORITHM auto (type: string, default: auto)
Algorithm for barrier. Options are auto, linear, tree, dissem
SHMEM_BCAST_ALGORITHM auto (type: string, default: auto)
Algorithm for broadcast. Options are auto, linear, tree
SHMEM_REDUCE_ALGORITHM auto (type: string, default: auto)
Algorithm for reductions. Options are auto, linear, tree, recdbl
SHMEM_COLLECT_ALGORITHM auto (type: string, default: auto)
Algorithm for collect. Options are auto, linear
SHMEM_FCOLLECT_ALGORITHM auto (type: string, default: auto)
Algorithm for fcollect. Options are auto, linear, ring, recdbl
Network transport: Portals 4
On-node transport: Linux CMA
SHMEM_CMA_PUT_MAX 8192 (type: size, default: 8192)
Size below which to use CMA for puts
SHMEM_CMA_GET_MAX 16384 (type: size, default: 16384)
Size below which to use CMA for gets
Build information:
Git Version v1.4.0rc2-10-g0da87837 (pr/fix-cma)
Configure Args '--prefix=/home/dinanjam/opt/sos-portals'
'--with-portals4=/home/dinanjam/opt/portals4'
'--enable-pmi-simple' '--disable-cxx' '--enable-debug'
'--enable-picky' '--enable-error-checking'
'--enable-remote-virtual-addressing' '--with-cma'
Build Date Tue Jan 30 16:13:22 EST 2018
Build CC gcc -std=gnu99
Build CFLAGS -g -O2 -Wall -Wno-long-long -Wmissing-prototypes
-Wstrict-prototypes -Wcomment -pedantic -g
-fvisibility=hidden
[0000] DEBUG: ../../src/init.c:247: shmem_internal_init
[0000] Sym. heap=0x80000000 len=537919488 -- data=0x600da8 len=24
[0001] DEBUG: ../../src/init.c:247: shmem_internal_init
[0001] Sym. heap=0x80000000 len=537919488 -- data=0x600da8 len=24
[0000] ERROR: ../../src/transport_portals4.c:535: shmem_transport_startup
[0000] PtlLEAppend of all memory failed: 1
[0000] ERROR: ../../src/init.c:291: shmem_internal_init
[0000] Transport startup failed (1)
[0000] WARN: ../../src/transport_portals4.c:670: shmem_transport_fini
[0000] put count mismatch: 0, 140733616661616
[0000] WARN: ../../src/transport_portals4.c:672: shmem_transport_fini
[0000] put operations failed: 99
[0000] WARN: ../../src/transport_portals4.c:678: shmem_transport_fini
[0000] get count mismatch: 0, 140733616661616
[0000] WARN: ../../src/transport_portals4.c:680: shmem_transport_fini
[0000] get operations failed: 99
[0001] ERROR: ../../src/transport_portals4.c:535: shmem_transport_startup
[0001] PtlLEAppend of all memory failed: 1
[0001] ERROR: ../../src/init.c:291: shmem_internal_init
[0001] Transport startup failed (1)
[0001] WARN: ../../src/transport_portals4.c:670: shmem_transport_fini
[0001] put count mismatch: 0, 140735686053520
[0001] WARN: ../../src/transport_portals4.c:672: shmem_transport_fini
[0001] put operations failed: 99
[0001] WARN: ../../src/transport_portals4.c:678: shmem_transport_fini
[0001] get count mismatch: 0, 140735686053520
[0001] WARN: ../../src/transport_portals4.c:680: shmem_transport_fini
[0001] get operations failed: 99
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18858 RUNNING AT compute-0-1
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:[email protected]] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:[email protected]] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:[email protected]] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[[email protected]] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[[email protected]] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[[email protected]] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[[email protected]] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion