mercury icon indicating copy to clipboard operation
mercury copied to clipboard

mercury does not appear to understand 'mrail' protocol

Open roblatham00 opened this issue 5 years ago • 3 comments

Describe the bug

I am unable to request the 'mrail' libfabric provider from mercury (master)

To Reproduce

I have tried the margo-p2p-bw test with the following network strings:

    mpiexec -f hostfile -launcher ssh -ppn 1 -n 2 ./margo-p2p-bw -x 13072 -n 'mrail://' -c 4 -D 10
   # NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
 # na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
 # NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
 # hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
 # HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
 # HG_Init_opt(): Could not create HG core class
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
 # na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
 # NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
 # hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
 # HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
 # HG_Init_opt(): Could not create HG core class

Ok, let's try explicitly requesting OFI:

mpiexec -f hostfile -launcher ssh -ppn 1 -n 2 ./margo-p2p-bw -x 13072 -n 'ofi+mrail://' -c 4 -D 10
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
 # na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
 # NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
 # hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
 # HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
 # HG_Init_opt(): Could not create HG core class
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
 # na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
 # NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
 # hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
 # HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
 # HG_Init_opt(): Could not create HG core class

Expected behavior

Does mercury need to know about any possible libfabric provider? I see configuration for verbs and gni, but that seems like a pretty major abstraction violation

Platform (please complete the following information):

  • ORNL Summit
  • gcc-9.1
  • ofi, attempting to use the 'mrail' provider
  • libfabric-1.8.1

Additional context Add any other context about the problem here.

roblatham00 avatar Nov 22 '19 21:11 roblatham00

There is a big x-macro that enumerates all of the OFI providers that Mercury supports in the code here:

https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L114

... and yes, as it stands right now Mercury will only run atop things that it can find in the array of config structs that macro generates.

Philosophically it would be nice if Mercury would run atop any provider transparently, but Mercury takes a bunch of different strategies depending on what capabilities are likely to work in each one.

Maybe we could have a fall-back that just tries it's best if it's given an ofi+ that's not in the table? Or maybe there is a more clever way to differentiate settings between providers than a hard coded table?

carns avatar Nov 22 '19 21:11 carns

I opened this issue for the philisophical point, but in this specific case it looks like mrail requires a lot of legwork to use

roblatham00 avatar Nov 26 '19 15:11 roblatham00

I think it should be feasible to simply default to whatever OFI returns and just have a warning printed in that case with some information so that we have a chance to know what we are using at least :)

soumagne avatar Nov 26 '19 22:11 soumagne