hpx
hpx copied to clipboard
pu binding is inconsistent
Expected Behavior
Binding of cores should be the same on successive runs with unchanged parameters
What I actually get appears to be semi-random and changes from run to run
jb-s76:~/build/hpx-master$ bin/network_storage --iterations=20 --localMB=8 --transferKB=1 --no-local=1 --hpx:agas=localhost:7910 --hpx:hpx=localhost:7911 --hpx:localities=3 --hpx:worker -Ihpx.parcel.libfabric.enable=1 -Ihpx.parcel.bootstrap=libfabric --hpx:threads=2 --hpx:high-priority-threads=1 --hpx:bind=thread:0-1=numanode:0.core:2.pu:0-1 --hpx:print-bind
*******************************************************************************
locality: 1
0: PU L#0(P#0), Core L#0(P#0), Socket L#0(P#0), on pool "default"
1: PU L#1(P#4), Core L#0(P#0), Socket L#0(P#0), on pool "default"
hello world from OS-thread 01 on locality 0001 rank 0001 hostname jb-s76:127.0.0.#1
jb-s76:~/build/hpx-master$ bin/network_storage --iterations=20 --localMB=8 --transferKB=1 --no-local=1 --hpx:agas=localhost:7910 --hpx:hpx=localhost:7911 --hpx:localities=3 --hpx:worker -Ihpx.parcel.libfabric.enable=1 -Ihpx.parcel.bootstrap=libfabric --hpx:threads=2 --hpx:high-priority-threads=1 --hpx:bind=thread:0-1=numanode:0.core:2.pu:0-1 --hpx:print-bind
*******************************************************************************
locality: 2
0: PU L#4(P#2), Core L#2(P#2), Socket L#0(P#0), on pool "default"
1: PU L#5(P#6), Core L#2(P#2), Socket L#0(P#0), on pool "default"
hello world from OS-thread 01 on locality 0002 rank 0002 hostname jb-s76:127.0.0.#2
@biddisco Did you run those localities on the same physical node?
I am running 3 localities on the same node as shown below, but each locality should bind to it's own core and PUs. What I see is that the processes bind to the same cores (instead of different ones) and do not bind to the same ones on successive runs.
# agas server
bin/network_storage --iterations=20 --localMB=8 --transferKB=1 --no-local=1 \
--hpx:agas=localhost:7910 --hpx:hpx=localhost:7910 --hpx:localities=3 \
-Ihpx.parcel.libfabric.enable=1 -Ihpx.parcel.bootstrap=libfabric \
--hpx:threads=2 --hpx:high-priority-threads=1 \
--hpx:bind=thread:0-1=numanode:0.core:2.pu:0-1 --hpx:print-bind
# worker #0
bin/network_storage --iterations=20 --localMB=8 --transferKB=1 --no-local=1 \
--hpx:agas=localhost:7910 --hpx:hpx=localhost:7911 --hpx:localities=3 --hpx:worker \
-Ihpx.parcel.libfabric.enable=1 -Ihpx.parcel.bootstrap=libfabric \
--hpx:threads=2 --hpx:high-priority-threads=1 \
--hpx:bind=thread:0-1=numanode:0.core:0.pu:0-1 --hpx:print-bind
# worker 1
bin/network_storage --iterations=20 --localMB=8 --transferKB=1 --no-local=1 \
--hpx:agas=localhost:7910 --hpx:hpx=localhost:7912 --hpx:localities=3 --hpx:worker \
-Ihpx.parcel.libfabric.enable=1 -Ihpx.parcel.bootstrap=libfabric \
--hpx:threads=2 --hpx:high-priority-threads=1 \
--hpx:bind=thread:0-1=numanode:0.core:1.pu:0-1 --hpx:print-bind
Note that the locality #number can vary between runs, but that's just a racey manifestation of which one gets assigned which rank at startup.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Keeping this alive as a reminder to myself to look at it again
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.