weave-tc
weave-tc copied to clipboard
No AAAA packet delay observed in tcpdump
Hi, thank you so much for the blog post! We've been seeing the intermittent dns issue for quite some time now, and your workaround seems to be the most promising so far.
I've installed it on one of our kubernetes cluster but I'm still seeing 2.5s dns queries at about the same rate as before installation. As you suggested in #3, I tried playing around with the delay. When I set it to 50ms, I couldn't see any delay for the AAAA packets; the A and AAAA packets are within a millisecond apart.
My questions are:
- How do I check if the tc rules are working as they should? Where can I observe the delay?
- I guess I don't really understand the packet flow of this workaround in general. Does the marked AAAA packet enter the netem queue POSTROUTING on the client host or the mark persists across host and it enters the netem queue on the server?
Things I have tried:
-
tcpdump -i weave 'udp port 53' | grep some-random-query
on client/server hosts and a coredns pod, no delay - Adding
-j LOG
to the iptables rule, packets seem to be marked just fine. Below is an example output ofiptables -L POSTROUTING -v -n -t mangle --line-numbers | grep 00001c0001
command:
1 43M 3407M MARK udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:53 STRING match "|00001c0001|" ALGO name bm FROM 40 TO 65535 u32 "0x1c&0xf8=0x0" MARK or 0x100
-
watch tc -d -s qdisc show dev weave
the netem queue packets didn't increase when I was making the query. I've tried this on both client and server hosts.
Every 2.0s: tc -s qdisc show dev weave Fri Nov 16 22:24:46 2018
qdisc prio 1: root refcnt 2 bands 2 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Sent 15579969405 bytes 19324241 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc fq_codel 12: parent 1:2 limit 10240p flows 1024 quantum 8206 target 5.0ms interval 100.0ms ecn
Sent 15579857578 bytes 19323180 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
maxpacket 65970 drop_overlimit 0 new_flow_count 11744479 ecn_mark 0
new_flows_len 1 old_flows_len 33
qdisc netem 11: parent 1:1 limit 1000 delay 50.0ms 1.0ms
Sent 111827 bytes 1061 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
- I also have two nodejs processes (containerized, one on alpine and another on debian) that do 64 dns lookups every second and send datadog metrics of the lookup time.
Hi there,
Sorry, this issue slipped through the cracks of my flooded mailbox.
One thing I am wondering is that you are mentioning 2.5s delay typically, whereas the default DNS timeout is 5s and kinda characteristic of lost DNS packets. Have you checked the insert_failed
increases? If not, then it may not be the same bug.
The filtering happens here, matching the destination port, DNS query bits & AAAA bits. It will leave a mark on the package, which should be visible in a tcpdump (or like you did).
The latency is introduced when the packet hits the ${NET_OVERLAY_IF}
interface, as long as the packet is marked. Typically, the ${NET_OVERLAY_IF}
(e.g. weave, flannel...) will then do its own logic to route the packets to the right destination node - with an artificial delay on select packets. This is because the bug is triggered when the packet lands on the destination node, as it's being handled by the kernel.
It looks lile the netem qdisc does have some packets it managed? The only reason it would is if the filtering was working.
Hi, No worries at all. I haven't looked into this for a while either. Thanks for getting back to me.
One thing I am wondering is that you are mentioning 2.5s delay typically, whereas the default DNS timeout is 5s and kinda characteristic of lost DNS packets. Have you checked the
insert_failed
increases? If not, then it may not be the same bug.
I'm using alpine and, if I read the musl libc code correctly, it polls for the dns response with a total timeout of 5s with two attempts maximum. So each poll has a 2.5s timeout as referenced below. polling logic: http://git.musl-libc.org/cgit/musl/tree/src/network/res_msend.c#n111 timeout and attempts config: https://git.musl-libc.org/cgit/musl/tree/src/network/resolvconf.c#n17
I usually observed 2.5s delay because it mostly got the response in the second try but I've also seen 5s delay in the mix too. I definitely saw insert_failed
increased.
The latency is introduced when the packet hits the
${NET_OVERLAY_IF}
interface, as long as the packet is marked. Typically, the${NET_OVERLAY_IF}
(e.g. weave, flannel...) will then do its own logic to route the packets to the right destination node - with an artificial delay on select packets. This is because the bug is triggered when the packet lands on the destination node, as it's being handled by the kernel.
So the delay would happen on the source node, right? The weird thing is packets didn't increase around the time I made the queries. As I mentioned, I didn't see the number of filtered packets increased on neither the source nor the destination node, and I didn't see any delay on tcpdump either. (Our cluster gets hit badly by a "blocked by weave-npc" problem right now, I will double check these once we mitigate or fix the issue.)
Hi @chotiwat and @Quentin-M - I'm also noticing no packets being delayed:
$ sudo tc -s qdisc show dev weave
qdisc prio 1: root refcnt 2 bands 2 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Sent 9827532 bytes 47527 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc fq_codel 12: parent 1:2 limit 10240p flows 1024 quantum 8926 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
Sent 9827532 bytes 47527 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
maxpacket 36635 drop_overlimit 0 new_flow_count 12403 ecn_mark 0
new_flows_len 1 old_flows_len 3
qdisc netem 11: parent 1:1 limit 1000 delay 6.0ms 1.0ms
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Filtering seems OK (maybe?):
$ sudo tc filter show dev weave
filter parent 1: protocol all pref 1 fw
filter parent 1: protocol all pref 1 fw handle 0x100/0x100 classid 1:1
And there is an appropriate iptables mangle rule which seems to be working:
$ sudo iptables -L -n -v -t mangle
Chain PREROUTING (policy ACCEPT 529K packets, 547M bytes)
pkts bytes target prot opt in out source destination
Chain INPUT (policy ACCEPT 269K packets, 276M bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 300K packets, 279M bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 223K packets, 344M bytes)
pkts bytes target prot opt in out source destination
Chain POSTROUTING (policy ACCEPT 502K packets, 619M bytes)
pkts bytes target prot opt in out source destination
10472 1056K MARK udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:53 STRING match "|00001c0001|" ALGO name bm FROM 40 TO 65535 u32 "0x1c&0xf8=0x0" MARK or 0x100
Any ideas? I'm running script slightly modified (6ms of delay instead of 4ms)
Hi @hairyhenderson,
It does seem like the POSTROUTING rule works, as the packet count is not nil, so packets should get marked by the rule. Now, we notice that the netem qdisc is not routing (and thus delaying) any of the marked packets.
Could you verify whether the packets do have the 0x100/0x100 mark? My first thought would be that something else may rewrite the mark before it hits the qdisc?
@Quentin-M ok, I've added a log rule:
$ sudo iptables -t mangle -A POSTROUTING --match mark --mark 0x100/0x100 -j LOG
Which took effect:
Chain POSTROUTING (policy ACCEPT 48130 packets, 33M bytes)
pkts bytes target prot opt in out source destination
111K 11M MARK udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:53 STRING match "|00001c0001|" ALGO name bm FROM 40 TO 65535 u32 "0x1c&0xf8=0x0" MARK or 0x100
2619 247K LOG all -- * * 0.0.0.0/0 0.0.0.0/0 mark match 0x100/0x100 LOG flags 0 level 4
And I see marks on the logged traffic:
Jun 13 23:18:24 ip-10-20-93-83 kernel: [624200.081609] IN= OUT=weave PHYSIN=vethweplf67ca20 PHYSOUT=vethwe-bridge SRC=100.96.128.2 DST=100.110.0.2 LEN=101 TOS=0x00 PREC=0x00 TTL=64 ID=10197 DF PROTO=UDP SPT=55570 DPT=53 LEN=81 MARK=0x40100
Jun 13 23:18:24 ip-10-20-93-83 kernel: [624200.084066] IN= OUT=weave PHYSIN=vethweplf67ca20 PHYSOUT=vethwe-bridge SRC=100.96.128.2 DST=100.110.0.2 LEN=88 TOS=0x00 PREC=0x00 TTL=64 ID=10200 DF PROTO=UDP SPT=59214 DPT=53 LEN=68 MARK=0x40100
Interesting that the mark is 0x40100
, though since it's masked, that should be OK, right?
Also curious, in the 3 hours since I posted that, I now see 2 packets made it to the netem
qdisc:
$ sudo tc -s qdisc show dev weave
qdisc prio 1: root refcnt 2 bands 2 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Sent 332687884 bytes 274188 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc fq_codel 12: parent 1:2 limit 10240p flows 1024 quantum 8926 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
Sent 332413819 bytes 272917 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
maxpacket 62482 drop_overlimit 0 new_flow_count 52538 ecn_mark 0
new_flows_len 1 old_flows_len 5
qdisc netem 11: parent 1:1 limit 1000 delay 4.0ms 1.0ms
Sent 224 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
I'll dig into where the 0x40100
came from... 🤔
The nat
table has a couple chains that mark packets, but they use 0x4000
and 0x8000
:
Chain KUBE-MARK-DROP (0 references)
pkts bytes target prot opt in out source destination
0 0 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 MARK or 0x8000
Chain KUBE-MARK-MASQ (617 references)
pkts bytes target prot opt in out source destination
17 1020 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000
Either way, should the 0x100
used in tc filter
not match 0x40100
? It should be masked, right? It's been 15 years since I've worked regularly with netfilter, so I'm feeling quite out of my depth here 😅
Update: I do actually also see traffic marked with 0x100
in the logs:
Jun 14 00:00:59 ip-10-20-93-83 kernel: [626755.244006] IN= OUT=weave PHYSIN=vethwepl1285445 PHYSOUT=vethwe-bridge SRC=100.96.128.6 DST=100.110.0.2 LEN=104 TOS=0x00 PREC=0x00 TTL=64 ID=20324 DF PROTO=UDP SPT=34501 DPT=53 LEN=84 MARK=0x100
Jun 14 00:01:00 ip-10-20-93-83 kernel: [626756.099632] IN= OUT=weave PHYSIN=vethwepl1285445 PHYSOUT=vethwe-bridge SRC=100.96.128.6 DST=100.120.64.10 LEN=92 TOS=0x00 PREC=0x00 TTL=64 ID=20530 DF PROTO=UDP SPT=42855 DPT=53 LEN=72 MARK=0x100
Looks mostly the same. Over the period where I was logging, there were significantly more 0x40100
s (35487) than 0x100
s (1501).
Right, as long as our mark does not get overwritten by others, we are in the clear. Two packets in three hours.. That seems crazy low, even for a pretty inactive cluster.
Indeed. It was certainly missing some.
Since then, I've upgraded the kernel to 4.19.x (from 4.9), and disabled IPv6 in the kernel. After that, I stopped observing the high DNS latencies. I've not yet been able to narrow down which piece fixed it unfortunately.