hpx icon indicating copy to clipboard operation
hpx copied to clipboard

pu binding is inconsistent

Open biddisco opened this issue 5 years ago • 5 comments

Expected Behavior

Binding of cores should be the same on successive runs with unchanged parameters

What I actually get appears to be semi-random and changes from run to run

jb-s76:~/build/hpx-master$ bin/network_storage --iterations=20 --localMB=8 --transferKB=1 --no-local=1   --hpx:agas=localhost:7910 --hpx:hpx=localhost:7911 --hpx:localities=3 --hpx:worker   -Ihpx.parcel.libfabric.enable=1 -Ihpx.parcel.bootstrap=libfabric   --hpx:threads=2 --hpx:high-priority-threads=1   --hpx:bind=thread:0-1=numanode:0.core:2.pu:0-1 --hpx:print-bind
*******************************************************************************
locality: 1
   0: PU L#0(P#0), Core L#0(P#0), Socket L#0(P#0), on pool "default"
   1: PU L#1(P#4), Core L#0(P#0), Socket L#0(P#0), on pool "default"
hello world from OS-thread 01 on locality 0001 rank 0001 hostname jb-s76:127.0.0.#1
jb-s76:~/build/hpx-master$ bin/network_storage --iterations=20 --localMB=8 --transferKB=1 --no-local=1   --hpx:agas=localhost:7910 --hpx:hpx=localhost:7911 --hpx:localities=3 --hpx:worker   -Ihpx.parcel.libfabric.enable=1 -Ihpx.parcel.bootstrap=libfabric   --hpx:threads=2 --hpx:high-priority-threads=1   --hpx:bind=thread:0-1=numanode:0.core:2.pu:0-1 --hpx:print-bind
*******************************************************************************
locality: 2
   0: PU L#4(P#2), Core L#2(P#2), Socket L#0(P#0), on pool "default"
   1: PU L#5(P#6), Core L#2(P#2), Socket L#0(P#0), on pool "default"
hello world from OS-thread 01 on locality 0002 rank 0002 hostname jb-s76:127.0.0.#2

biddisco avatar Mar 26 '19 22:03 biddisco

@biddisco Did you run those localities on the same physical node?

hkaiser avatar Mar 26 '19 22:03 hkaiser

I am running 3 localities on the same node as shown below, but each locality should bind to it's own core and PUs. What I see is that the processes bind to the same cores (instead of different ones) and do not bind to the same ones on successive runs.

# agas server 
bin/network_storage --iterations=20 --localMB=8 --transferKB=1 --no-local=1 \
  --hpx:agas=localhost:7910 --hpx:hpx=localhost:7910 --hpx:localities=3 \
  -Ihpx.parcel.libfabric.enable=1 -Ihpx.parcel.bootstrap=libfabric \
  --hpx:threads=2 --hpx:high-priority-threads=1 \
  --hpx:bind=thread:0-1=numanode:0.core:2.pu:0-1 --hpx:print-bind

# worker #0
bin/network_storage --iterations=20 --localMB=8 --transferKB=1 --no-local=1 \
  --hpx:agas=localhost:7910 --hpx:hpx=localhost:7911 --hpx:localities=3 --hpx:worker \
  -Ihpx.parcel.libfabric.enable=1 -Ihpx.parcel.bootstrap=libfabric \
  --hpx:threads=2 --hpx:high-priority-threads=1 \
  --hpx:bind=thread:0-1=numanode:0.core:0.pu:0-1 --hpx:print-bind
 
# worker 1
bin/network_storage --iterations=20 --localMB=8 --transferKB=1 --no-local=1 \
  --hpx:agas=localhost:7910 --hpx:hpx=localhost:7912 --hpx:localities=3 --hpx:worker \
  -Ihpx.parcel.libfabric.enable=1 -Ihpx.parcel.bootstrap=libfabric \
  --hpx:threads=2 --hpx:high-priority-threads=1 \
  --hpx:bind=thread:0-1=numanode:0.core:1.pu:0-1 --hpx:print-bind

Note that the locality #number can vary between runs, but that's just a racey manifestation of which one gets assigned which rank at startup.

biddisco avatar Mar 26 '19 22:03 biddisco

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 22 '19 23:09 stale[bot]

Keeping this alive as a reminder to myself to look at it again

biddisco avatar Sep 27 '19 09:09 biddisco

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 25 '20 10:03 stale[bot]