nebula
nebula copied to clipboard
UDP send buffer errors over nebula interface
I'm proxying an application via an nginx proxy server to an app server over nebula, and despite a lot of tweaking to sysctl.conf settings and the read/write buffer sizes in the nebula configuration files, I keep getting regular UDP send buffer errors under even the lowest of loads. There are no buffer errors when not proxying over nebula.
I've tried increasing the following sysctl values. Upping net.ipv4.udp_wmen_min and rmen_min in particular seems to have helped, but only up to a point, beyond which any increases still result in regular send buffer errors.
relevant section of sysctl.conf:
net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.udp_rmem_min = 16384 net.ipv4.udp_wmem_min = 16384
relevant section of nebula config:
listen: host: 0.0.0.0 port: 4242 read_buffer: 33554432 write_buffer: 33554432
I've tried upping the write buffer in the nebula conf file but increased values seem to have no effect. Any idea what I'm missing?
Is it the nebula buffers or the nginx buffers that are having trouble?
Good question, how can I tell? I'm looking at the Udp section output of netstat -su on the application server.
Edit: assuming it's the nebula buffers as there are no buffer issues over a non-nebula connection?
I use ss -numpile
, it will output a line starting with skmem
that shows the memory stats specific to a socket instead of a global protocol stat.
Another thing to look at is the tun tx queue via ifconfig
, you may need to raise tun.tx_queue
in your config if drops are occurring there.
Interesting, thanks. Looking ss -numpile
output I'm seeing the expected value wmem/rmem being allocated (double what's specified in the nebula config) and the sendq usage doesn't seem to come close to that in use, so it seems unlikely that wmem is being maxed out. I'm a bit at a loss here.
I also don't see any errors or drops listed on the interface via ifconfig
, but raising tun.tx_queue to 1000 from 500 and observing results for a bit doesn't seem to have made any difference, either.
Usually we see rmem
being the culprit, you can generally make your listen.write_buffer
lower than listen.read_buffer
. You also don't need to adjust sysctl for these changes since nebula uses a special syscall in linux to force the size of the buffer that you configure.
For ss -numpile
you will get output similar to:
users:(("nebula",pid=83197,fd=7)) ino:439906 sk:3 cgroup:/system.slice/nebula.service v6only:0 <->
skmem:(r0,rb212992,t0,tb212992,f4096,w0,o0,bl0,d0)
if d
(the last item in the list) is above 0 then you have drops on this socket. You can look at r
and rb
or t
and tb
. If r
is approaching rb
then raise listen.read_buffer
and ensure d
is no longer increasing. If t
is approaching tb
then raise listen.write_buffer
until d
stops increasing.
If the nebula socket(s) aren't experiencing buffer problems then you can use ss -numpile
to find the socket that is increasing the global udp error counter that you referenced earlier, and refer to that programs documentation to hopefully clear the issue.
Aside from the udp global stats, are you noticing any issues with nginx over nebula?
Thanks very much for this explanation. I started changing sysctl settings before realizing that the nebula config had its own buffer size configuration.
Looking through ss -numpile
output, the odd thing is that I am not seeing drops on the nebula socket or any other socket. They are all d0. The only place I am seeing errors is in the netstat -su
output:
Udp:
3091100 packets received
39 packets to unknown port received
0 packet receive errors
5557302 packets sent
0 receive buffer errors
3696 send buffer errors
So I'm not exactly sure what I'm looking for, and googling isn't much help. Is it possible to see send buffer errors that aren't drops?
I'm not seeing any issues with nginx over nebula in practice, so I may be chasing ghosts, but want to figure out what's going on before switching to nebula in production in case this will result in potential problems with more traffic.
What is your tun.mtu
setting? Any interesting logs from nebula?
What is your
tun.mtu
setting? Any interesting logs from nebula?
Nothing interesting in the nebula service logs, just handshakes. Nothing in syslog either that I can see. tun.mtu
is set to 1000, was 500 earlier, both resulting in similar amount of send buffer errors.
And this only occurs when you use nebula? Grasping for straws at the moment, any chance this host doesn't have ipv6 enabled and the other side of the tunnel you are using to test with does?
Yes, only when using nebula. Works fine with no udp buffer errors otherwise. Both hosts have ipv6 enabled, but I did notice something in the logs that puzzled me:
Jun 22 17:48:55 HOSTNAME nebula[237081]: time="2021-06-22T17:48:55-07:00" level=error msg="Failed to send handshake message" error="sendto: address family not supported by protocol" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=4075839950 udpAddr="[XXXX:XXX:XXXX:XXXX::1]:4242" vpnIp=XXX.XXX.XXX.XXX
Not sure why a valid ipv6 address wouldn't be supported by udp protocol? Could be related?
Is your listen.host
set to [::]
or 0.0.0.0
?
It was 0.0.0.0
, changed it to [::]
but still getting send buffer errors.
Any chance you can jump in the NebulaOSS slack, would be easier there. If not, assuming you've done a full nebula restart after that config change, are you seeing any errors in the logs?
Yup, restarted after making the config change, just a few handshake messages and four identical "Refusing to handshake with myself" error messages for the host in question.
"Refusing to handshake with myself" is an interesting one, can you provide the config on this machine? Should be entirely unrelated, hopefully a configuration issue.
Any of those log messages an error? Is the udp error counter increasing?
"Refusing to handshake with myself" are all level=error
. Error count still increasing, the amount is just dependent on traffic load (usually by between 30-80 at once). There are no further nebula log messages after the initial startup/handshakes.
Full config:
pki:
ca: /etc/nebula/certs/ca.crt
cert: /etc/nebula/certs/goblin.crt
key: /etc/nebula/certs/goblin.key
static_host_map:
"192.168.110.1": ["XXX.XXX.XXX.XXX:4242"]
lighthouse:
hosts:
- "192.168.110.1"
# Ignore docker interfaces and bridge network
local_allow_list:
interfaces:
'br-*': false
'docker.*': false
listen:
host: "[::]"
port: 4242
read_buffer: 16777216
write_buffer: 16777216
punchy:
punch: true
tun:
dev: nebula1
drop_local_broadcast: false
drop_multicast: false
tx_queue: 1000
mtu: 1300
routes:
unsafe_routes:
firewall:
conntrack:
tcp_timeout: 120h
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
# Allow all outbound traffic from this node
- port: any
proto: any
host: any
inbound:
# Allow all between any nebula hosts
- port: any
proto: any
host: any
What is the nebula ip of this machine? Are you running v1.4 or v1.3 here?
My assumption currently is that another machine is advertising an unroutable ip address that nebula is trying to hole punch to, causing the udp error counter to increase. You can verify by enabling the sshd
and running list-hostmap
and/or list-lighthouse-addrmap
and checking to see if there are any unroutable addresses in the list.
I also see your firewall.conntrack.tcp_timeout
is set to 120h
, the default is 12m
, did you mean to raise it?
Running v1.4. Nebula IP is 192.168.110.2 (lighthouse is 192.168.110.1).
To simplify the troubleshooting I changed the configs for the lighthouse/hosts to only listen on the public ipv4 address for each machine. No more error messages in the nebula logs, but still seeing send buffer errors at the same rate...
Circling back around, were you able to determine if nebula was the source of the udp send buffer errors?
Sadly I wasn't able to resolve this. I kept seeing udp send buffer errors on the nebula interface despite many settings tweaks, and ultimately moved reverse proxy traffic out of the tunnel.
Would you happen to be running Docker, LXC, libvirt, etc. that manages interfaces? I'm struggling with a few nodes that are seeing a lot of "Refusing to handshake with myself" messages on both CentOS and Debian boxes that are running Docker and/or LXC (via libvirt); udpAddr
may or may not change between handshake attempts.
I'd wager there's some remote_allow_list
/local_allow_list
settings that could deal with at least a bit of that, but I don't know it off the top of my head. Or I might not know what I'm talking about, that's also always a possibility.
It's unfortunate that we weren't able to resolve this, but since it's been over a year since the last comment by the ticket author I'm going to close this issue out as stale. Please file another issue or ping me to reopen this one if you're continuing to investigate / have issues!