mercury
mercury copied to clipboard
mercury does not appear to understand 'mrail' protocol
Describe the bug
I am unable to request the 'mrail' libfabric provider from mercury (master)
To Reproduce
I have tried the margo-p2p-bw
test with the following network strings:
mpiexec -f hostfile -launcher ssh -ppn 1 -n 2 ./margo-p2p-bw -x 13072 -n 'mrail://' -c 4 -D 10
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
# na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
# NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
# hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
# HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
# HG_Init_opt(): Could not create HG core class
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
# na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
# NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
# hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
# HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
# HG_Init_opt(): Could not create HG core class
Ok, let's try explicitly requesting OFI:
mpiexec -f hostfile -launcher ssh -ppn 1 -n 2 ./margo-p2p-bw -x 13072 -n 'ofi+mrail://' -c 4 -D 10
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
# na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
# NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
# hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
# HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
# HG_Init_opt(): Could not create HG core class
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
# na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
# NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
# hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
# HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
# HG_Init_opt(): Could not create HG core class
Expected behavior
Does mercury need to know about any possible libfabric provider? I see configuration for verbs and gni, but that seems like a pretty major abstraction violation
Platform (please complete the following information):
- ORNL Summit
- gcc-9.1
- ofi, attempting to use the 'mrail' provider
- libfabric-1.8.1
Additional context Add any other context about the problem here.
There is a big x-macro that enumerates all of the OFI providers that Mercury supports in the code here:
https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L114
... and yes, as it stands right now Mercury will only run atop things that it can find in the array of config structs that macro generates.
Philosophically it would be nice if Mercury would run atop any provider transparently, but Mercury takes a bunch of different strategies depending on what capabilities are likely to work in each one.
Maybe we could have a fall-back that just tries it's best if it's given an ofi+
I opened this issue for the philisophical point, but in this specific case it looks like mrail requires a lot of legwork to use
I think it should be feasible to simply default to whatever OFI returns and just have a warning printed in that case with some information so that we have a chance to know what we are using at least :)