weave-tc No AAAA packet delay observed in tcpdump

Hi, thank you so much for the blog post! We've been seeing the intermittent dns issue for quite some time now, and your workaround seems to be the most promising so far.

I've installed it on one of our kubernetes cluster but I'm still seeing 2.5s dns queries at about the same rate as before installation. As you suggested in #3, I tried playing around with the delay. When I set it to 50ms, I couldn't see any delay for the AAAA packets; the A and AAAA packets are within a millisecond apart.

My questions are:

How do I check if the tc rules are working as they should? Where can I observe the delay?
I guess I don't really understand the packet flow of this workaround in general. Does the marked AAAA packet enter the netem queue POSTROUTING on the client host or the mark persists across host and it enters the netem queue on the server?

Things I have tried:

tcpdump -i weave 'udp port 53' | grep some-random-query on client/server hosts and a coredns pod, no delay
Adding -j LOG to the iptables rule, packets seem to be marked just fine. Below is an example output of iptables -L POSTROUTING -v -n -t mangle --line-numbers | grep 00001c0001 command:

1      43M 3407M MARK       udp  --  *      *       0.0.0.0/0            0.0.0.0/0            udp dpt:53 STRING match  "|00001c0001|" ALGO name bm FROM 40 TO 65535 u32 "0x1c&0xf8=0x0" MARK or 0x100

watch tc -d -s qdisc show dev weave the netem queue packets didn't increase when I was making the query. I've tried this on both client and server hosts.

Every 2.0s: tc -s qdisc show dev weave                                            Fri Nov 16 22:24:46 2018

qdisc prio 1: root refcnt 2 bands 2 priomap  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 Sent 15579969405 bytes 19324241 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 12: parent 1:2 limit 10240p flows 1024 quantum 8206 target 5.0ms interval 100.0ms ecn
 Sent 15579857578 bytes 19323180 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 65970 drop_overlimit 0 new_flow_count 11744479 ecn_mark 0
  new_flows_len 1 old_flows_len 33
qdisc netem 11: parent 1:1 limit 1000 delay 50.0ms  1.0ms
 Sent 111827 bytes 1061 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

I also have two nodejs processes (containerized, one on alpine and another on debian) that do 64 dns lookups every second and send datadog metrics of the lookup time.

Nov 16 '18 22:11 chotiwat

Hi there,

Sorry, this issue slipped through the cracks of my flooded mailbox.

One thing I am wondering is that you are mentioning 2.5s delay typically, whereas the default DNS timeout is 5s and kinda characteristic of lost DNS packets. Have you checked the insert_failed increases? If not, then it may not be the same bug.

The filtering happens here, matching the destination port, DNS query bits & AAAA bits. It will leave a mark on the package, which should be visible in a tcpdump (or like you did).

The latency is introduced when the packet hits the ${NET_OVERLAY_IF} interface, as long as the packet is marked. Typically, the ${NET_OVERLAY_IF} (e.g. weave, flannel...) will then do its own logic to route the packets to the right destination node - with an artificial delay on select packets. This is because the bug is triggered when the packet lands on the destination node, as it's being handled by the kernel.

It looks lile the netem qdisc does have some packets it managed? The only reason it would is if the filtering was working.

Feb 14 '19 01:02 Quentin-M

Hi, No worries at all. I haven't looked into this for a while either. Thanks for getting back to me.

One thing I am wondering is that you are mentioning 2.5s delay typically, whereas the default DNS timeout is 5s and kinda characteristic of lost DNS packets. Have you checked the insert_failed increases? If not, then it may not be the same bug.

I'm using alpine and, if I read the musl libc code correctly, it polls for the dns response with a total timeout of 5s with two attempts maximum. So each poll has a 2.5s timeout as referenced below. polling logic: http://git.musl-libc.org/cgit/musl/tree/src/network/res_msend.c#n111 timeout and attempts config: https://git.musl-libc.org/cgit/musl/tree/src/network/resolvconf.c#n17

I usually observed 2.5s delay because it mostly got the response in the second try but I've also seen 5s delay in the mix too. I definitely saw insert_failed increased.

The latency is introduced when the packet hits the ${NET_OVERLAY_IF} interface, as long as the packet is marked. Typically, the ${NET_OVERLAY_IF} (e.g. weave, flannel...) will then do its own logic to route the packets to the right destination node - with an artificial delay on select packets. This is because the bug is triggered when the packet lands on the destination node, as it's being handled by the kernel.

So the delay would happen on the source node, right? The weird thing is packets didn't increase around the time I made the queries. As I mentioned, I didn't see the number of filtered packets increased on neither the source nor the destination node, and I didn't see any delay on tcpdump either. (Our cluster gets hit badly by a "blocked by weave-npc" problem right now, I will double check these once we mitigate or fix the issue.)

Feb 16 '19 00:02 chotiwat

Hi @chotiwat and @Quentin-M - I'm also noticing no packets being delayed:

$ sudo tc -s qdisc show dev weave
qdisc prio 1: root refcnt 2 bands 2 priomap  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 Sent 9827532 bytes 47527 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 12: parent 1:2 limit 10240p flows 1024 quantum 8926 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
 Sent 9827532 bytes 47527 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 36635 drop_overlimit 0 new_flow_count 12403 ecn_mark 0
  new_flows_len 1 old_flows_len 3
qdisc netem 11: parent 1:1 limit 1000 delay 6.0ms  1.0ms
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

Filtering seems OK (maybe?):

$ sudo tc filter show dev weave
filter parent 1: protocol all pref 1 fw
filter parent 1: protocol all pref 1 fw handle 0x100/0x100 classid 1:1

And there is an appropriate iptables mangle rule which seems to be working:

$ sudo iptables -L -n -v -t mangle
Chain PREROUTING (policy ACCEPT 529K packets, 547M bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain INPUT (policy ACCEPT 269K packets, 276M bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 300K packets, 279M bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 223K packets, 344M bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain POSTROUTING (policy ACCEPT 502K packets, 619M bytes)
 pkts bytes target     prot opt in     out     source               destination
10472 1056K MARK       udp  --  *      *       0.0.0.0/0            0.0.0.0/0            udp dpt:53 STRING match  "|00001c0001|" ALGO name bm FROM 40 TO 65535 u32 "0x1c&0xf8=0x0" MARK or 0x100

Any ideas? I'm running script slightly modified (6ms of delay instead of 4ms)

Jun 13 '19 20:06 hairyhenderson

Hi @hairyhenderson,

It does seem like the POSTROUTING rule works, as the packet count is not nil, so packets should get marked by the rule. Now, we notice that the netem qdisc is not routing (and thus delaying) any of the marked packets.

Could you verify whether the packets do have the 0x100/0x100 mark? My first thought would be that something else may rewrite the mark before it hits the qdisc?

Jun 13 '19 22:06 Quentin-M

@Quentin-M ok, I've added a log rule:

$ sudo iptables -t mangle -A POSTROUTING --match mark --mark 0x100/0x100 -j LOG

Which took effect:

Chain POSTROUTING (policy ACCEPT 48130 packets, 33M bytes)
 pkts bytes target     prot opt in     out     source               destination
 111K   11M MARK       udp  --  *      *       0.0.0.0/0            0.0.0.0/0            udp dpt:53 STRING match  "|00001c0001|" ALGO name bm FROM 40 TO 65535 u32 "0x1c&0xf8=0x0" MARK or 0x100
 2619  247K LOG        all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x100/0x100 LOG flags 0 level 4

And I see marks on the logged traffic:

Jun 13 23:18:24 ip-10-20-93-83 kernel: [624200.081609] IN= OUT=weave PHYSIN=vethweplf67ca20 PHYSOUT=vethwe-bridge SRC=100.96.128.2 DST=100.110.0.2 LEN=101 TOS=0x00 PREC=0x00 TTL=64 ID=10197 DF PROTO=UDP SPT=55570 DPT=53 LEN=81 MARK=0x40100
Jun 13 23:18:24 ip-10-20-93-83 kernel: [624200.084066] IN= OUT=weave PHYSIN=vethweplf67ca20 PHYSOUT=vethwe-bridge SRC=100.96.128.2 DST=100.110.0.2 LEN=88 TOS=0x00 PREC=0x00 TTL=64 ID=10200 DF PROTO=UDP SPT=59214 DPT=53 LEN=68 MARK=0x40100

Interesting that the mark is 0x40100, though since it's masked, that should be OK, right?

Also curious, in the 3 hours since I posted that, I now see 2 packets made it to the netem qdisc:

$ sudo tc -s qdisc show dev weave
qdisc prio 1: root refcnt 2 bands 2 priomap  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 Sent 332687884 bytes 274188 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 12: parent 1:2 limit 10240p flows 1024 quantum 8926 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
 Sent 332413819 bytes 272917 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 62482 drop_overlimit 0 new_flow_count 52538 ecn_mark 0
  new_flows_len 1 old_flows_len 5
qdisc netem 11: parent 1:1 limit 1000 delay 4.0ms  1.0ms
 Sent 224 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

I'll dig into where the 0x40100 came from... 🤔

Jun 13 '19 23:06 hairyhenderson

The nat table has a couple chains that mark packets, but they use 0x4000 and 0x8000:

Chain KUBE-MARK-DROP (0 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK or 0x8000

Chain KUBE-MARK-MASQ (617 references)
 pkts bytes target     prot opt in     out     source               destination
   17  1020 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Either way, should the 0x100 used in tc filter not match 0x40100? It should be masked, right? It's been 15 years since I've worked regularly with netfilter, so I'm feeling quite out of my depth here 😅

Jun 13 '19 23:06 hairyhenderson

Update: I do actually also see traffic marked with 0x100 in the logs:

Jun 14 00:00:59 ip-10-20-93-83 kernel: [626755.244006] IN= OUT=weave PHYSIN=vethwepl1285445 PHYSOUT=vethwe-bridge SRC=100.96.128.6 DST=100.110.0.2 LEN=104 TOS=0x00 PREC=0x00 TTL=64 ID=20324 DF PROTO=UDP SPT=34501 DPT=53 LEN=84 MARK=0x100
Jun 14 00:01:00 ip-10-20-93-83 kernel: [626756.099632] IN= OUT=weave PHYSIN=vethwepl1285445 PHYSOUT=vethwe-bridge SRC=100.96.128.6 DST=100.120.64.10 LEN=92 TOS=0x00 PREC=0x00 TTL=64 ID=20530 DF PROTO=UDP SPT=42855 DPT=53 LEN=72 MARK=0x100

Looks mostly the same. Over the period where I was logging, there were significantly more 0x40100s (35487) than 0x100s (1501).

Jun 14 '19 00:06 hairyhenderson

Right, as long as our mark does not get overwritten by others, we are in the clear. Two packets in three hours.. That seems crazy low, even for a pretty inactive cluster.

Jul 17 '19 22:07 Quentin-M

Indeed. It was certainly missing some.

Since then, I've upgraded the kernel to 4.19.x (from 4.9), and disabled IPv6 in the kernel. After that, I stopped observing the high DNS latencies. I've not yet been able to narrow down which piece fixed it unfortunately.

Jul 18 '19 02:07 hairyhenderson

weave-tc weave-tc copied to clipboard

No AAAA packet delay observed in tcpdump

weave-tc
weave-tc copied to clipboard