mercury icon indicating copy to clipboard operation
mercury copied to clipboard

NA OFI: hostname not working on debian loopback interface

Open utkarshayachit opened this issue 3 years ago • 9 comments

My objective is to make the server run on a specific port. Seems like explicitly specifying the address to use should work. However, it does not work is I use hostname instead of IP address or interface name when specifying the address for the server.

#include <unistd.h>
#include <thallium.hpp>
#include <string>

namespace tl = thallium;

/**
 * This is the ParaT server executable.
 */
int main(int argc, char* argv[])
{
  char buffer[256];
  gethostname(buffer, 256);

  std::ostringstream str;
  str << "tcp://" << buffer << ":11111";
  tl::engine myEngine(str.str(), THALLIUM_SERVER_MODE);
  std::cout << "requested address: " << str.str() << std::endl;
  std::cout << "server running at address: " << myEngine.self() << std::endl;
  return 0;
}

The output from this is as follows:

requested address: tcp://miron:11111
server running at address: ofi+tcp;ofi_rxm://192.168.1.73:33499

Platform (please complete the following information):

  • System description: Ubuntu 20.04.3 LTS
  • Compiler version: GCC 9.3.0
  • Plugin and protocol used [e.g. ofi, psm2]
  • Dependency version: libfabric-1.13.0

utkarshayachit avatar Dec 17 '21 18:12 utkarshayachit

@utkarshayachit PR #537 should fix your issue, I just merged it to master. Would you be able to try it out?

soumagne avatar Dec 22 '21 22:12 soumagne

thanks for the fix...it still fails for me; although now I get an error message rather that it just picking a random port.

Here's the output from the same test code from the issue.

# [31877.645652] mercury->fatal: [error] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na_ofi.c:2034
 # na_ofi_domain_open(): No provider found for "tcp;ofi_rxm" provider on domain "miron"
[error] Could not initialize hg_class
terminate called after throwing an instance of 'thallium::margo_exception'
  what():  [/home/utkarsh/Kitware/Mochi/spack/opt/spack/linux-ubuntu20.04-broadwell/gcc-9.3.0/mochi-thallium-develop-yufqz3w5lpooqe2rmafnxkynt7mr4kyc/include/thallium/engine.hpp:180][margo_init_ext] Could not initialize Margo

utkarshayachit avatar Dec 23 '21 21:12 utkarshayachit

Thanks. Alright I think we'll get to the bottom of it though. Can you please export HG_LOG_LEVEL=warning HG_LOG_SUBSYS=na and rerun your test, that will give us more details. Also the output of fi_info on your system would be helpful. I am not entirely sure yet until I see the logs but it looks like your hostname cannot be resolved for some reason.

soumagne avatar Dec 24 '21 10:12 soumagne

Also thinking more about it, you might want to double check also that your /etc/hosts does not associate your hostname to an interface that is down or something like that. I can't really think of anything else that would be significantly different on your system compared to the ones we use.

soumagne avatar Dec 30 '21 19:12 soumagne

here are the results

# [481.582259] mercury->na: [error] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na_ip.c:212
 # na_ip_check_interface(): No ifa_name match found for IP
# [481.582289] mercury->cls: [warning] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na_ofi.c:3762
 # na_ofi_initialize(): Could not find matching interface for miron, attempting to use it as domain name
# [481.583238] mercury->fatal: [error] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na_ofi.c:2034
 # na_ofi_domain_open(): No provider found for "tcp;ofi_rxm" provider on domain "miron"
# [481.583258] mercury->cls: [error] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na_ofi.c:3834
 # na_ofi_initialize(): Could not open domain for tcp;ofi_rxm, miron
# [481.583270] mercury->cls: [error] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na.c:339
 # NA_Initialize_opt(): Could not initialize plugin
[error] Could not initialize hg_class
terminate called after throwing an instance of 'thallium::margo_exception'
  what():  [/home/utkarsh/Kitware/Mochi/spack/opt/spack/linux-ubuntu20.04-broadwell/gcc-9.3.0/mochi-thallium-develop-yufqz3w5lpooqe2rmafnxkynt7mr4kyc/include/thallium/engine.hpp:180][margo_init_ext] Could not initialize Margo
fish: “env HG_LOG_LEVEL=warning HG_LOG…” terminated by signal SIGABRT (Abort)
cat /etc/hosts
127.0.0.1 view-localhost
127.0.0.1       localhost
127.0.1.1       miron

utkarshayachit avatar Jan 03 '22 12:01 utkarshayachit

Thanks, I believe the third line in your /etc/hosts file is causing the issues you're having. You should either remove it or have your permanent IP address assigned instead as documented there: https://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_hostname_resolution

soumagne avatar Jan 03 '22 13:01 soumagne

Having said that, we should probably also be able to support this type of loopback IP, I'll have a look to see if we can also do that. In that case I'd expect you to have miron:11111 resolved as 127.0.1.1:11111, probably not the IP you'd want anyway but we should somehow support it.

soumagne avatar Jan 03 '22 13:01 soumagne

FWIW, removing it does indeed seem to solve the issue

utkarshayachit avatar Jan 03 '22 13:01 utkarshayachit

Great, thanks for confirming.

soumagne avatar Jan 03 '22 13:01 soumagne