SOS icon indicating copy to clipboard operation
SOS copied to clipboard

Portals, CMA, and Remote VA

Open jdinan opened this issue 7 years ago • 0 comments

This is apparently a toxic combination. If you remove --enable-remote-virtual-addressing all is well. Not sure why this breaks or whether it can actually be fixed...

$ mpiexec -np 2 -hosts compute-0-1,compute-0-2 -env SHMEM_DEBUG=1 -env SHMEM_INFO=1 test/unit/hello
Sandia OpenSHMEM 1.4.0rc2
  SHMEM_INFO                 1 (type: bool, default: 0)
	Print library information message at startup
  SHMEM_VERSION              0 (type: bool, default: 0)
	Print library version at startup
  SHMEM_DEBUG                1 (type: bool, default: 0)
	Enable debugging messages
  SHMEM_SYMMETRIC_SIZE       536870912 (type: size, default: 536870912)
	Symmetric heap size

Additional options:
  SHMEM_SYMMETRIC_HEAP_USE_HUGE_PAGES 0 (type: bool, default: 0)
	Use Linux huge pages for symmetric heap
  SHMEM_SYMMETRIC_HEAP_PAGE_SIZE 2097152 (type: size, default: 2097152)
	Page size to use for huge pages
  SHMEM_SYMMETRIC_HEAP_USE_MALLOC 0 (type: bool, default: 0)
	Allocate the symmetric heap using malloc
  SHMEM_BOUNCE_SIZE          2048 (type: size, default: 2048)
	Maximum message size to bounce buffer
  SHMEM_MAX_BOUNCE_BUFFERS   128 (type: long, default: 128)
	Maximum number of bounce buffers per context
  SHMEM_TRAP_ON_ABORT        0 (type: bool, default: 0)
	Generate trap if the program aborts or calls shmem_global_exit

Collectives options:
  SHMEM_COLL_CROSSOVER       4 (type: long, default: 4)
	Crossover between linear and tree collectives
  SHMEM_COLL_RADIX           4 (type: long, default: 4)
	Radix for tree-based collectives
  SHMEM_BARRIER_ALGORITHM    auto (type: string, default: auto)
	Algorithm for barrier.  Options are auto, linear, tree, dissem
  SHMEM_BCAST_ALGORITHM      auto (type: string, default: auto)
	Algorithm for broadcast.  Options are auto, linear, tree
  SHMEM_REDUCE_ALGORITHM     auto (type: string, default: auto)
	Algorithm for reductions.  Options are auto, linear, tree, recdbl
  SHMEM_COLLECT_ALGORITHM    auto (type: string, default: auto)
	Algorithm for collect.  Options are auto, linear
  SHMEM_FCOLLECT_ALGORITHM   auto (type: string, default: auto)
	Algorithm for fcollect.  Options are auto, linear, ring, recdbl

Network transport: Portals 4

On-node transport: Linux CMA
  SHMEM_CMA_PUT_MAX          8192 (type: size, default: 8192)
	Size below which to use CMA for puts
  SHMEM_CMA_GET_MAX          16384 (type: size, default: 16384)
	Size below which to use CMA for gets

Build information:
  Git Version           v1.4.0rc2-10-g0da87837 (pr/fix-cma)
  Configure Args        '--prefix=/home/dinanjam/opt/sos-portals'
                        '--with-portals4=/home/dinanjam/opt/portals4'
                        '--enable-pmi-simple' '--disable-cxx' '--enable-debug'
                        '--enable-picky' '--enable-error-checking'
                        '--enable-remote-virtual-addressing' '--with-cma'
  Build Date            Tue Jan 30 16:13:22 EST 2018
  Build CC              gcc -std=gnu99
  Build CFLAGS          -g -O2 -Wall -Wno-long-long -Wmissing-prototypes
                        -Wstrict-prototypes -Wcomment -pedantic -g
                        -fvisibility=hidden

[0000] DEBUG: ../../src/init.c:247: shmem_internal_init
[0000]        Sym. heap=0x80000000 len=537919488 -- data=0x600da8 len=24
[0001] DEBUG: ../../src/init.c:247: shmem_internal_init
[0001]        Sym. heap=0x80000000 len=537919488 -- data=0x600da8 len=24
[0000] ERROR: ../../src/transport_portals4.c:535: shmem_transport_startup
[0000]        PtlLEAppend of all memory failed: 1
[0000] ERROR: ../../src/init.c:291: shmem_internal_init
[0000]        Transport startup failed (1)
[0000] WARN:  ../../src/transport_portals4.c:670: shmem_transport_fini
[0000]        put count mismatch: 0, 140733616661616
[0000] WARN:  ../../src/transport_portals4.c:672: shmem_transport_fini
[0000]        put operations failed: 99
[0000] WARN:  ../../src/transport_portals4.c:678: shmem_transport_fini
[0000]        get count mismatch: 0, 140733616661616
[0000] WARN:  ../../src/transport_portals4.c:680: shmem_transport_fini
[0000]        get operations failed: 99
[0001] ERROR: ../../src/transport_portals4.c:535: shmem_transport_startup
[0001]        PtlLEAppend of all memory failed: 1
[0001] ERROR: ../../src/init.c:291: shmem_internal_init
[0001]        Transport startup failed (1)
[0001] WARN:  ../../src/transport_portals4.c:670: shmem_transport_fini
[0001]        put count mismatch: 0, 140735686053520
[0001] WARN:  ../../src/transport_portals4.c:672: shmem_transport_fini
[0001]        put operations failed: 99
[0001] WARN:  ../../src/transport_portals4.c:678: shmem_transport_fini
[0001]        get count mismatch: 0, 140735686053520
[0001] WARN:  ../../src/transport_portals4.c:680: shmem_transport_fini
[0001]        get operations failed: 99

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 18858 RUNNING AT compute-0-1
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:[email protected]] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:[email protected]] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:[email protected]] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[[email protected]] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[[email protected]] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[[email protected]] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[[email protected]] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

jdinan avatar Jan 30 '18 21:01 jdinan