weave-tc icon indicating copy to clipboard operation
weave-tc copied to clipboard

Has anyone tried this on Azure....?

Open rlees85 opened this issue 5 years ago • 3 comments
trafficstars

Well I have.... and although everything "looks" like its working, it is not - and DNS resolution is still very poor.

So this is a long post, apologies. I don't really have a suitable daemonset to "tack" this onto, so I am using a dedicated daemonset for it:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: weave-tc
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: weave-tc
  template:
    metadata:
      labels:
        name: weave-tc
    spec:
      hostNetwork: true
      containers:
      - name: weave-tc
        image: rlees85/summit-weave-tc-temp:latest
        env:
          - name: DNSMASQ_PORT
            value: "53"
          - name: NET_OVERLAY_IF
            value: "azure0"
          - name: TARGET_DELAY_MS
            value: "10"
          - name: TARGET_SKEW_MS
            value: "1"
        securityContext:
          privileged: true
        volumeMounts:
        - name: xtables-lock
          mountPath: /run/xtables.lock
        - name: lib-tc
          mountPath: /lib/tc
          readOnly: true
      volumes:
      - name: xtables-lock
        hostPath:
          path: /run/xtables.lock
      - name: lib-tc
        hostPath:
          path: /usr/lib/tc

The image: rlees85/summit-weave-tc-temp:latest is pretty much the same as yours, except the 4ms and 1ms parameters are exposed as environment variables for easier tweaking without rebuilding the image.

This is what the logs look like:

+ DNSMASQ_PORT=53
+ NET_OVERLAY_IF=azure0
+ sysctl -w 'net.core.default_qdisc=fq_codel'
net.core.default_qdisc = fq_codel
+ route
+ grep ^default
+ grep -o '[^ ]*$'
+ tc qdisc del dev azure0 root
+ grep -o '[^ ]*$'
+ grep ^default
+ route
+ tc qdisc add dev azure0 root handle 0: mq
RTNETLINK answers: Not supported
+ true
+ iptables -F POSTROUTING -t mangle
+ ip link
+ grep azure0
+ tc qdisc del dev azure0 root
+ true
+ tc qdisc add dev azure0 root handle 1: prio bands 2 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+ tc qdisc add dev azure0 parent 1:2 handle 12: fq_codel
+ tc qdisc add dev azure0 parent 1:1 handle 11: netem delay 10ms 1ms distribution pareto
+ tc filter add dev azure0 protocol all parent 1: prio 1 handle 0x100/0x100 fw flowid 1:1
+ iptables -C POSTROUTING -t mangle -p udp --dport 53 -m string -m u32 --u32 '28 & 0xF8 = 0' --hex-string '|00001C0001|' --algo bm --from 40 -j MARK --set-mark 0x100/0x100
iptables: No chain/target/match by that name.
+ iptables -A POSTROUTING -t mangle -p udp --dport 53 -m string -m u32 --u32 '28 & 0xF8 = 0' --hex-string '|00001C0001|' --algo bm --from 40 -j MARK --set-mark 0x100/0x100
+ sleep 3600

I am getting packets through the filter, which proves the marking I guess. But the count is quite low... This has been running for a hour or so on a quiet cluster.

/ # tc -d -s qdisc show dev azure0
qdisc prio 1: root refcnt 2 bands 2 priomap  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 Sent 69961726 bytes 117085 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc fq_codel 12: parent 1:2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn 
 Sent 69934283 bytes 116854 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
  maxpacket 44885 drop_overlimit 0 new_flow_count 18727 ecn_mark 0
  new_flows_len 1 old_flows_len 13
qdisc netem 11: parent 1:1 limit 1000 delay 10.0ms  1.0ms
 Sent 26866 bytes 229 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0

If I enable logging I can see packets are marked:

[979508.507677] IN= OUT=azure0 PHYSIN=azv977e954f886 PHYSOUT=eth0 SRC=10.102.6.136 DST=10.102.10.171 LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=9933 DF PROTO=UDP SPT=32785 DPT=53 LEN=102 MARK=0x100

But.... performance is still garbage.

url='ifconfig.co'
if [ -f /tmp/log.txt ]; then rm /tmp/log.txt; fi
for i in `seq 1 20`; do curl -w '%{time_namelookup}\n' -o /dev/null -s $url >> /tmp/log.txt; echo "${i}"; done
sort /tmp/log.txt | uniq | tail -10
0.016456
0.017804
0.023226
0.028758
0.029971
0.037986
0.039217
0.041410
2.659334
5.059728

I've tried messing with 4/1ms right up to 100/?ms and nothing seems to improve it.

We don't have access to other options like running DNS as a daemonset. These are Azure AKS based clusters and unfortunately moving out of Azure or going self-managed on Azure is not an option. I just wondered if there was any more debug paths to get this workaround to work?

rlees85 avatar May 13 '20 15:05 rlees85

Dear @rlees85,

Can you see an increment of insert_failed every time you get a 5sec DNS resolution? The kernel race condition can't explain 2.66 seconds, it should either be really low, and 5sec + base response time.

docker run --net=host --privileged --rm -it --entrypoint=watch cap10morgan/conntrack -n1 conntrack -S

Quentin-M avatar May 18 '20 19:05 Quentin-M

Thanks for replying. Interesting that this might not be my problem after all....

As I am running on Kubernetes and the weave-tc daemonset already has privileged access with host networking I just exec'd onto one of those pods and installed conntrack. I don't really get how conntrack works but there is a long list of cpu's, only the first few had a lot of insert_failed. Because it is a very busy cluster it was hard to determine for sure if they were rising as I was doing DNS tests I would need to build a quieter cluster to make sure of that.

The DNS lag was always a multiple of 2.5 though, either 2.5, 5, 7.5 seconds. So the lag seems to be 2.5 seconds but could be more due to Kubernetes ndots: 5 setting.

Interestingly if I set the timeout in resolv.conf to 1 second, I get lag in multiples of 1.

Forcing (the alpine container) to do only IPv4 or IPv6 lookups (ping -4 ... etc) resolutions were fine, no lag.

Also running Debian containers with single-request-reopen the problem does not appear.

rlees85 avatar May 19 '20 20:05 rlees85

@rlees85

As I am running on Kubernetes and the weave-tc daemonset already has privileged access with host networking I just exec'd onto one of those pods and installed conntrack. I don't really get how conntrack works but there is a long list of cpu's, only the first few had a lot of insert_failed. Because it is a very busy cluster it was hard to determine for sure if they were rising as I was doing DNS tests I would need to build a quieter cluster to make sure of that.

You should run the tool on the host itself, not inside a specific pod. Then you should be seeing insert_failed increase as you get DNS timeouts. That's when you know that the problem is related to the kernel race condition (and not related to something else, like another network issue or another DNS server issue)

The DNS lag was always a multiple of 2.5 though, either 2.5, 5, 7.5 seconds. So the lag seems to be 2.5 seconds but could be more due to Kubernetes ndots: 5 setting. Interestingly if I set the timeout in resolv.conf to 1 second, I get lag in multiples of 1.

I was gonna say, the timeout is configurable as per https://www.man7.org/linux/man-pages/man5/resolv.conf.5.html. Maybe your setup defaults to 2.5sec timeouts.

Forcing (the alpine container) to do only IPv4 or IPv6 lookups (ping -4 ... etc) resolutions were fine, no lag. Also running Debian containers with single-request-reopen the problem does not appear.

Right, so it does sound like you're definitely having the problem.

qdisc prio 1: root refcnt 2 bands 2 priomap  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 Sent 69961726 bytes 117085 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0

qdisc fq_codel 12: parent 1:2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn 
 Sent 69934283 bytes 116854 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
  maxpacket 44885 drop_overlimit 0 new_flow_count 18727 ecn_mark 0
  new_flows_len 1 old_flows_len 13

qdisc netem 11: parent 1:1 limit 1000 delay 10.0ms  1.0ms
 Sent 26866 bytes 229 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0

Given this output, it would seem that only 229 packets were marked as IPv6 DNS packets, whereas 117,085 packets flowed in total. Does that sound like what you'd have expected? I am just not familiar with azure0 and wonder if that's the right interface to use. Every time you do a DNS resolution via musl/glibc, you should see that number increase - meaning that it caught the packet properly.

Quentin-M avatar May 26 '20 23:05 Quentin-M