nebula
nebula copied to clipboard
Slow/stuck transfer files in nebula network between America/Europe/Asia nodes.
Hello, I can't figure out what's wrong and why the transfer rate drops significantly to KB / s, and almost freezes. I tried to set lower mtu on two nodes(but not on all network nodes), but this is didn`t help. I tried to disable Europe lighthouse nodes in all network, but result same too. This problem i had on 1.5.2 version of Nebula too. Can anyone advise or check? Thank you.
Asia / Usa through Nebula:
root@sg:/dev# rsync -av --progress america.vpn.ip:/tmp/50M.file /tmp/ [email protected]'s password: receiving incremental file list 50M.file 7,634,944 14% 67.03kB/s 0:11:08 1,179,648 2% 71.06kB/s 0:12:01 1,277,952 2% 58.15kB/s 0:14:39 1,310,720 2% 9.79kB/s 1:27:00 1,966,080 3% 45.57kB/s 0:18:27 1,998,848 3% 39.42kB/s 0:21:19 2,064,384 3% 46.14kB/s 0:18:11 2,097,152 4% 49.54kB/s 0:16:56
Asia / Usa through Internet:
root@sg:/dev# rsync -av --progress america.real.ip:/tmp/50M.file /tmp/ [email protected]'s password: receiving incremental file list 50M.file 31,817,728 60% 6.58MB/s 0:00:03
With other transfers files from any continent to any continent in any direction - i have same problems.
I have 14 lighthouse nodes: 4 in Europe, 10 in America
Lighthouse configuration:
static_host_map:
"10.10.0.1": ["1.2.3.1:12345"] #Europe
"10.10.0.2": ["1.2.3.2:12345"] #Europe
"10.10.0.3": ["1.2.3.3:12345"] #Europe
"10.10.0.4": ["1.2.3.4:12345"] #Europe
"10.10.0.5": ["1.2.3.5:12345"] #America
"10.10.0.6": ["1.2.3.6:12345"] #America
"10.10.0.7": ["1.2.3.7:12345"] #America
"10.10.0.8": ["1.2.3.8:12345"] #America
"10.10.0.9": ["1.2.3.9:12345"] #America
"10.10.0.10": ["1.2.3.10:12345"] #America
"10.10.0.11": ["1.2.3.11:12345"] #America
"10.10.0.12": ["1.2.3.12:12345"] #America
"10.10.0.13": ["1.2.3.13:12345"] #America
"10.10.0.14": ["1.2.3.14:12345"] #America
lighthouse:
am_lighthouse: true
interval: 30
hosts:
listen:
host: 0.0.0.0
port: 12345
punchy:
punch: true
relay:
am_relay: true
use_relays: false
tun:
disabled: false
dev: n2n0
drop_local_broadcast: true
drop_multicast: true
tx_queue: 500
mtu: 1290
routes:
unsafe_routes: #Additionally i have some unsafe routes
- route: 192.168.24.0/24
via: 10.10.0.79
mtu: 1290
metric: 100
- route: 192.168.32.0/24
via: 10.10.0.87
mtu: 1290
metric: 100
logging:
level: warning
format: text
firewall:
conntrack:
tcp_timeout: 15m
udp_timeout: 3m
default_timeout: 10m
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: any
group: any
Others nodes configuration:
static_host_map:
"10.10.0.1": ["1.2.3.1:12345"] #Europe
"10.10.0.2": ["1.2.3.2:12345"] #Europe
"10.10.0.3": ["1.2.3.3:12345"] #Europe
"10.10.0.4": ["1.2.3.4:12345"] #Europe
"10.10.0.5": ["1.2.3.5:12345"] #America
"10.10.0.6": ["1.2.3.6:12345"] #America
"10.10.0.7": ["1.2.3.7:12345"] #America
"10.10.0.8": ["1.2.3.8:12345"] #America
"10.10.0.9": ["1.2.3.9:12345"] #America
"10.10.0.10": ["1.2.3.10:12345"] #America
"10.10.0.11": ["1.2.3.11:12345"] #America
"10.10.0.12": ["1.2.3.12:12345"] #America
"10.10.0.13": ["1.2.3.13:12345"] #America
"10.10.0.14": ["1.2.3.14:12345"] #America
lighthouse:
am_lighthouse: false
interval: 30
hosts:
- "10.10.0.1"
- "10.10.0.2"
- "10.10.0.3"
- "10.10.0.4"
- "10.10.0.5"
- "10.10.0.6"
- "10.10.0.7"
- "10.10.0.8"
- "10.10.0.9"
- "10.10.0.10"
- "10.10.0.11"
- "10.10.0.12"
- "10.10.0.13"
- "10.10.0.14"
listen:
host: 0.0.0.0
port: 12345
punchy:
punch: true
relay:
relays:
- "10.10.0.1"
- "10.10.0.2"
- "10.10.0.3"
- "10.10.0.4"
- "10.10.0.5"
- "10.10.0.6"
- "10.10.0.7"
- "10.10.0.8"
- "10.10.0.9"
- "10.10.0.10"
- "10.10.0.11"
- "10.10.0.12"
- "10.10.0.13"
- "10.10.0.14"
am_relay: false
use_relays: true
tun:
disabled: false
dev: n2n0
drop_local_broadcast: true
drop_multicast: true
tx_queue: 500
mtu: 1290
routes:
unsafe_routes: #Additionally i have some unsafe routes
- route: 192.168.24.0/24
via: 10.10.0.79
mtu: 1290
metric: 100
- route: 192.168.32.0/24
via: 10.10.0.87
mtu: 1290
metric: 100
logging:
level: warning
format: text
firewall:
conntrack:
tcp_timeout: 15m
udp_timeout: 3m
default_timeout: 10m
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: any
group: any
That's a lot of lighthouses. Why does your network have so many?
A few quick thoughts that might help - (1) In your host's relay.relays section, only list relays that are close to that host, in terms of ping time. Meaning, I expect European hosts would only list the European relays, and American hosts would only list American relays. Your American relays could even be further segmented, if they're in different geographic regions - so American east-coast hosts would only list relays on the east coast, and vice versa for west-coast hosts. I expect those geographic realities to result in lower latency, and therefore faster ping times. (2) In each host's config, specify
listen:
read_buffer: 10485760
write_buffer: 10485760
(these values come out of the commented-out values in the example Nebula config file here: https://github.com/slackhq/nebula/blob/master/examples/config.yml#L106)
If you hop into the OSS Nebula slack channel, you can get support there, too.
Hello thanks for reply.
I uncomment read/write buffers on all network hosts but this is didn`t help. Result is same, transfer file sometimes started fast, then stuck and continues with modem 56k speed, then can sometimes increase.
I try to set routines: 8 additionally. This is didn`t help.
I try to leave only 4 lighthouse nodes in USA in all network hosts, this is not help. And on previous version of Nebula: 1.5.2 when no had relays result has been same.
Is there any way to find out what could be the problem? Maybe change the mtu from 1290 to 1127, put it even lower? Or increase tx_queue from 500 to 3000?
What I know for sure is that the internet is fast between hosts from Asia and the US or Europe and the US.
Thanks.
50M.file
950,272 1% 791.81kB/s 0:01:05
983,040 1% 133.50kB/s 0:06:25
1,015,808 1% 73.94kB/s 0:11:35
1,048,576 2% 52.77kB/s 0:16:13
1,081,344 2% 6.10kB/s 2:20:24
1,343,488 2% 17.97kB/s 0:47:23
1,376,256 2% 18.18kB/s 0:46:48
3,080,192 5% 104.83kB/s 0:07:50
3,309,568 6% 104.76kB/s 0:07:48
3,342,336 6% 87.21kB/s 0:09:22
3,375,104 6% 86.32kB/s 0:09:28
3,407,872 6% 13.74kB/s 0:59:26
3,440,640 6% 5.14kB/s 2:38:47
3,473,408 6% 5.14kB/s 2:38:42
3,506,176 6% 5.14kB/s 2:38:34
3,538,944 6% 5.14kB/s 2:38:29
Additionally I attach my sysctl.conf (same on most servers in network) Maybe something in it interferes with the normal operation of the tunnels? Although there are no such problems between servers where ping is good, so I'm not sure if something is interfering.
#IP Forward
net/ipv4/ip_forward=1
#High Load Systems
net/ipv4/tcp_tw_reuse=1
#Disable ipv6
net/ipv6/conf/all/disable_ipv6=1
net/ipv6/conf/default/disable_ipv6=1
net/ipv6/conf/lo/disable_ipv6=1
#Max Concurent Connections
net/core/somaxconn=262144
#Disable Accept Source Routing
net/ipv4/conf/all/accept_source_route=0
#Disable Accept Redirects
net/ipv4/conf/all/accept_redirects=0
#Enable Anti Spoofing
net/ipv4/conf/all/rp_filter=1
#Enable Ignore Broadcast Packets
net/ipv4/icmp_echo_ignore_broadcasts=1
#Enable Logging Bad Error Message Protection
net/ipv4/icmp_ignore_bogus_error_responses=1
#Disable Logging Spoofes Packets, Source Routed Packets, Redirect Packets
net/ipv4/conf/all/log_martians=0
#Optimal Network Parameters
net/ipv4/tcp_congestion_control=yeah
net/core/netdev_max_backlog=262144
net/ipv4/tcp_no_metrics_save=1
net/ipv4/tcp_low_latency=1
net/ipv4/tcp_max_syn_backlog=262144
net/ipv4/tcp_mtu_probing=1
net/core/optmem_max=67108864
net/core/rmem_default=212992
net/core/wmem_default=212992
net/core/rmem_max=67108864
net/core/wmem_max=67108864
net/ipv4/tcp_rmem=4096 87380 33554432
net/ipv4/tcp_wmem=4096 65536 33554432
#Decrease TCP FIN TimeOut
net/ipv4/tcp_fin_timeout=3
#Decrease TCP KeepAlive Connections Interval
net/ipv4/tcp_keepalive_time=300
#Decrease TCP KeepAlive Sents
net/ipv4/tcp_keepalive_probes=3
#Disable SACK
net/ipv4/tcp_sack=0
#Time Orphan Retries
net/ipv4/tcp_orphan_retries=1
#Swap On 10% of Memory
vm/swappiness=10
#Core Pids
kernel/core_uses_pid=1
#Increase Inotify Settings
fs/inotify/max_user_watches=524288
fs/inotify/max_queued_events=65536
#Virtual Memory Settings
vm/overcommit_memory=1
vm/max_map_count=262144
#Auto-Reboot on Kernel Panic
kernel/panic=60
#Auto-Log on Kernel Panic
kernel/panic_on_oops=1