arachne icon indicating copy to clipboard operation
arachne copied to clipboard

many timeout: yes spot out

Open rmrf opened this issue 7 years ago • 4 comments

within our cloud deployment, for internal, with nearly 10+ hosts inside the array .

I can see many "timed_out":"yes" lines from the log output, and from statsd there is did no metrics between these 2 hosts. the communication seems working well under tcpdump view.

any other reason can makeing this timeout happen? will this caused by host_name instead of ip inside internal array?

rmrf avatar Jun 13 '17 09:06 rmrf

I tried using ip instead of host_name, still no help. even increase timeout to "timeout": "5s",, still showing

{"level":"info","ts":1497349039.773023,"msg":"Result","PID":25342,"hostname":"host1,"service_name":"arachne","host":"host4.","source_port":32001,"2-way":"-","1-way":"-","timed_out":"yes"}

rmrf avatar Jun 13 '17 10:06 rmrf

{"level":"error","ts":1497349506.3300288,"msg":"host exists in resultStore, but not in remoteStore","PID":16496,
"hostname":"host4","service_name":"arachne","host":"100.100.19.42",

btw, when I narrow down the communication between host1 and host4, I catch this kind of error log as above, this strange ip 100.100.19.42, I can find it nowhere! not sure how it came into the log.

not sure this related or not.

rmrf avatar Jun 13 '17 10:06 rmrf

Hi, @rmrf ! Are the timeouts happening just for source port 32001? If you try a range of ephemeral source ports do all timeout? You could troubleshoot that easily if you check the table generated in stdin when running with -foreground mode enabled. Could the listen port on host4 be firewalled?

Vasilis avatar Jun 14 '17 18:06 Vasilis

@Vasilis I'm not sure if the firewall working there which out of my control (we are using arachne inside a public cloud env), but I do tcpdump for the target port , I can see the network packet:

10:29:17.255501 IP 192.168.79.1.41013 > 192.168.64.249.54111: Flags [S], seq 4188118887, win 43690, length 0
10:29:17.255522 IP 192.168.79.1.41014 > 192.168.64.249.54111: Flags [S], seq 4188118887, win 43690, length 0
10:29:17.255531 IP 192.168.79.1.41015 > 192.168.64.249.54111: Flags [S], seq 4188118887, win 43690, length 0
10:29:17.255667 IP 192.168.64.249.54111 > 192.168.79.1.41000: Flags [R.], seq 0, ack 4188118888, win 0, length 0
10:29:17.255710 IP 192.168.64.249.54111 > 192.168.79.1.41001: Flags [R.], seq 0, ack 4188118888, win 0, length 0
10:29:17.255761 IP 192.168.64.249.54111 > 192.168.79.1.41002: Flags [R.], seq 0, ack 4188118888, win 0, length 0
10:29:17.255804 IP 192.168.64.249.54111 > 192.168.79.1.41003: Flags [R.], seq 0, ack 4188118888, win 0, length 0
10:29:17.255846 IP 192.168.64.249.54111 > 192.168.79.1.41004: Flags [R.], seq 0, ack 4188118888, win 0, length 0
10:29:17.255854 IP 192.168.64.249.54111 > 192.168.79.1.41005: Flags [R.], seq 0, ack 4188118888, win 0, length 0

here I'm run tcpdump on 192.168.79.1 host, with "base_src_tcp_port": 41000, and "target_tcp_port": 54111, configuration. the length 0 is the main different compare with those working nodes. which making the timeout happen.

Weird thing is: if I only run arachne between these 2 hosts above, everything working fine. then can see each other, but when I put internal configuration with 13 hosts, these 2 hosts can't see each other.

I tried to increase "num_src_tcp_ports": 16, to 160, but this didn't help, and I can notice latency increase like x5 when do this increase.

not sure what happen.

rmrf avatar Mar 22 '18 02:03 rmrf