arachne
arachne copied to clipboard
many timeout: yes spot out
within our cloud deployment, for internal
, with nearly 10+ hosts inside the array .
I can see many "timed_out":"yes"
lines from the log output, and from statsd there is did no metrics between these 2 hosts. the communication seems working well under tcpdump view.
any other reason can makeing this timeout happen? will this caused by host_name
instead of ip
inside internal array?
I tried using ip
instead of host_name
, still no help. even increase timeout to "timeout": "5s",
, still showing
{"level":"info","ts":1497349039.773023,"msg":"Result","PID":25342,"hostname":"host1,"service_name":"arachne","host":"host4.","source_port":32001,"2-way":"-","1-way":"-","timed_out":"yes"}
{"level":"error","ts":1497349506.3300288,"msg":"host exists in resultStore, but not in remoteStore","PID":16496,
"hostname":"host4","service_name":"arachne","host":"100.100.19.42",
btw, when I narrow down the communication between host1
and host4
, I catch this kind of error log as above, this strange ip 100.100.19.42
, I can find it nowhere! not sure how it came into the log.
not sure this related or not.
Hi, @rmrf !
Are the timeouts happening just for source port 32001
?
If you try a range of ephemeral source ports do all timeout? You could troubleshoot that easily if you check the table generated in stdin when running with -foreground
mode enabled.
Could the listen port on host4 be firewalled?
@Vasilis I'm not sure if the firewall working there which out of my control (we are using arachne inside a public cloud env), but I do tcpdump for the target port , I can see the network packet:
10:29:17.255501 IP 192.168.79.1.41013 > 192.168.64.249.54111: Flags [S], seq 4188118887, win 43690, length 0
10:29:17.255522 IP 192.168.79.1.41014 > 192.168.64.249.54111: Flags [S], seq 4188118887, win 43690, length 0
10:29:17.255531 IP 192.168.79.1.41015 > 192.168.64.249.54111: Flags [S], seq 4188118887, win 43690, length 0
10:29:17.255667 IP 192.168.64.249.54111 > 192.168.79.1.41000: Flags [R.], seq 0, ack 4188118888, win 0, length 0
10:29:17.255710 IP 192.168.64.249.54111 > 192.168.79.1.41001: Flags [R.], seq 0, ack 4188118888, win 0, length 0
10:29:17.255761 IP 192.168.64.249.54111 > 192.168.79.1.41002: Flags [R.], seq 0, ack 4188118888, win 0, length 0
10:29:17.255804 IP 192.168.64.249.54111 > 192.168.79.1.41003: Flags [R.], seq 0, ack 4188118888, win 0, length 0
10:29:17.255846 IP 192.168.64.249.54111 > 192.168.79.1.41004: Flags [R.], seq 0, ack 4188118888, win 0, length 0
10:29:17.255854 IP 192.168.64.249.54111 > 192.168.79.1.41005: Flags [R.], seq 0, ack 4188118888, win 0, length 0
here I'm run tcpdump on 192.168.79.1
host, with "base_src_tcp_port": 41000,
and "target_tcp_port": 54111,
configuration.
the length 0
is the main different compare with those working nodes. which making the timeout
happen.
Weird thing is:
if I only run arachne between these 2 hosts above, everything working fine. then can see each other, but when I put internal
configuration with 13 hosts, these 2 hosts can't see each other.
I tried to increase "num_src_tcp_ports": 16,
to 160, but this didn't help, and I can notice latency increase like x5 when do this increase.
not sure what happen.