machnet icon indicating copy to clipboard operation
machnet copied to clipboard

Engine Threads >1 "Failed to connect to remote host"

Open GeorgKreuzmayr opened this issue 1 year ago • 2 comments

Hello everyone,

when configuring the engine_threads to any value > 1 I get an error when connecting from a client to a server.

For the configuration with engine_threads=2, I get the error occasionally e.g. not for every port combination. The configuration engine_threads=16 fails more often e.g. on every port combination I tried.

The error message on the client is

ubuntu@ip-172-31-32-21:~$ ${MSG_GEN} --local_ip 172.31.32.121 --remote_ip 172.31.32.120 --msg_size 64 --msg_window 32
I20241103 18:50:38.682282     1 main.cc:332] Starting in client mode, request size 64
Checking for file descriptor...
Got a file descriptor!
ERROR: Failed to dequeue response from control queue.
F20241103 18:50:49.975369     1 main.cc:346] Check failed: ret == 0 Failed to connect to remote host. machnet_connect() error: Unknown error -1
*** Check failure stack trace: ***
    @     0x7fa3d8ce3f03  google::LogMessage::Fail()
    @     0x7fa3d8ce793c  google::LogMessage::SendToLog()
    @     0x7fa3d8ce39e7  google::LogMessage::Flush()
    @     0x7fa3d8ce509f  google::LogMessageFatal::~LogMessageFatal()
    @     0x562d0c932a28  main
    @     0x7fa3d8866d90  (unknown)

I have a server running on another EC2 instance with this command

ubuntu@ip-172-31-32-20:~$ ${MSG_GEN} --local_ip 172.31.32.120 --msg_size 64 

On the other hand, if I use engine_threads=1, the execution succeeds

ubuntu@ip-172-31-32-21:~$ ${MSG_GEN} --local_ip 172.31.32.121 --remote_ip 172.31.32.120 --msg_size 64 --msg_window 32
I20241103 18:06:00.837787     1 main.cc:332] Starting in client mode, request size 64
Checking for file descriptor...
Got a file descriptor!
I20241103 18:06:03.949545     1 main.cc:350] [CONNECTED] [172.31.32.121:1024 <-> 172.31.32.120:888]
I20241103 18:06:03.972815     7 main.cc:294] Client Loop: Starting.
TX/RX (msg/sec, Gbps): (0.0K/0.0K, 0.000/0.000). RTT (p50/99/99.9 us): 144/144/144
TX/RX (msg/sec, Gbps): (220.0K/220.0K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/195
TX/RX (msg/sec, Gbps): (220.0K/220.0K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/194
TX/RX (msg/sec, Gbps): (217.4K/217.4K, 0.111/0.111). RTT (p50/99/99.9 us): 143/179/543
TX/RX (msg/sec, Gbps): (220.0K/220.0K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/193
TX/RX (msg/sec, Gbps): (220.2K/220.2K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/190
TX/RX (msg/sec, Gbps): (220.1K/220.1K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/191
TX/RX (msg/sec, Gbps): (220.1K/220.1K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/189

MSG_GEN="docker run -v /var/run/machnet:/var/run/machnet ghcr.io/microsoft/machnet/machnet:latest release_build/src/apps/msg_gen/msg_gen"

Setup: Two EC2 instances of type c5n.18xlarge running Kernel 6.5.0-1014-aws on Ubuntu 23.10.

GeorgKreuzmayr avatar Nov 14 '24 07:11 GeorgKreuzmayr

Hi George

Thanks for using Machnet.

This behavior is expected. The master branch does not support Amazon VMs working with arbitrary engine numbers.

Would you try https://github.com/microsoft/machnet/tree/rss_blast branch to see if this would eliminate the issue?

Thanks, Alireza

sarsanaee avatar Nov 15 '24 13:11 sarsanaee

Hi @GeorgKreuzmayr,

As @sarsanaee pointed out we have this experimental branch to achieve connectivity when using multiple engines. We have not yet tried that in AWS; if you could give it a spin and let us know that would be helpful!

Ilias

marinosi avatar Nov 15 '24 13:11 marinosi