dpvs icon indicating copy to clipboard operation
dpvs copied to clipboard

vmware虚拟机环境采用vmxnet3虚网卡在full-nat模式包转发到RS后被丢弃不能被应用接收

Open flywalker opened this issue 4 years ago • 10 comments

现象:如题 经过抓包,发现RS接收到的包IP层的checksum字段填为0 QQ截图20200529152430 原因vmxnet3只支持tcp/udp的offload,不支持ip的offload 查dpdk代码,在版本dpdp18.11.2里vmxnet3返回的硬件能力中包含有ip的offload标志位,继续跟踪dpvs发送代码段,发现代码里只判断了tcp的offload能力就认为一定有ip的offload能力,故把checksum字段清零,而虚拟硬件确实没有ip的offload能力,所以也没填值就发送,RS端收到校验后认为错误就把包丢弃,造成应用无法收到转发的包 这个坑是dpdk-18.11.2版本和dpvs一起的bug dpdk-18.11.7已经把vmxnet3的问题修复,去除ip的offload标志位

flywalker avatar May 29 '20 07:05 flywalker

1、在ip_vs_proto_tcp.c里查找ip4_phdr_cksum,在计算出报文校验码之后去除IP_CSUM标志位,避免后继代码清零处理 mbuf->ol_flags |= (PKT_TX_TCP_CKSUM | PKT_TX_IP_CKSUM | PKT_TX_IPV4); th->check = ip4_phdr_cksum(iph, mbuf->ol_flags); 添加如下2行 if (unlikely((dev->flag & NETIF_PORT_FLAG_TX_IP_CSUM_OFFLOAD) != NETIF_PORT_FLAG_TX_IP_CSUM_OFFLOAD)) mbuf->ol_flags &= ~PKT_TX_IP_CKSUM;

2、在ip_vs_proto_udp.c里查找ip4_phdr_cksum,在计算出报文校验码之后去除IP_CSUM标志位,避免后继代码清零处理 mbuf->ol_flags |= (PKT_TX_UDP_CKSUM | PKT_TX_IP_CKSUM | PKT_TX_IPV4); uh->dgram_cksum = ip4_phdr_cksum(iph, mbuf->ol_flags); 添加如下2行 if (unlikely((dev->flag & NETIF_PORT_FLAG_TX_IP_CSUM_OFFLOAD) != NETIF_PORT_FLAG_TX_IP_CSUM_OFFLOAD)) mbuf->ol_flags &= ~PKT_TX_IP_CKSUM; 3、在ip_vs_synproxy.c里查找NETIF_PORT_FLAG_TX_,添加判断,无IP_CSUM标志位则不追加处理 注释mbuf->ol_flags |= (PKT_TX_TCP_CKSUM | PKT_TX_IP_CKSUM | PKT_TX_IPV4); 添加如下3行 mbuf->ol_flags |= (PKT_TX_TCP_CKSUM | PKT_TX_IPV4); if (likely(dev->flag & NETIF_PORT_FLAG_TX_IP_CSUM_OFFLOAD)) mbuf->ol_flags |= PKT_TX_IP_CKSUM; 4、dpdk的vmxnet3代码修正可对比18.11.7版本修改,只需去除收发能力的IPV4_CKSUM标志

自已初步摸索跟踪的结果,望官方指正

flywalker avatar May 29 '20 07:05 flywalker

牛逼,验证有效,dpdk18.11.11对于vmxnet3驱动已经修复这个问题了,但是dpvs version: 1.8-10还是存在这个问题,这么修改之后是可以的。否则连TCP握手都失败,RS直接抛弃SYN包

YiminXia avatar Jan 12 '22 11:01 YiminXia

@flywalker 你在vmxnet3部署测试是否可以full-nat?我这边没有测试成功,最新版本dpdk-20.11.1,转发没动静

liuflylove666 avatar Jan 26 '22 02:01 liuflylove666

@flywalker 你在vmxnet3部署测试是否可以full-nat?我这边没有测试成功,最新版本dpdk-20.11.1,转发没动静

就是在vmware虚机环境里验证full-nat碰到的问题,你可以查一下dpdk-20.11.1对应的vmxnet3代码,或者找新的小版本看看

flywalker avatar Feb 08 '22 00:02 flywalker

@flywalker 我查一下dpdk-20.11.1对应的vmxnet3代码checksum已经修复了,dpvs version:v1.8.12 里面没有看到ip4_phdr_cksum方法,看到的都是rte_ipv4_udptcp_cksum方法,请问还需如何更改,目前单臂在vmware虚机环境里验证full-nat还是不行

liuflylove666 avatar Feb 10 '22 07:02 liuflylove666

@flywalker 你这边有最新版本在vmware运行成功的代码吗?

liuflylove666 avatar Feb 11 '22 01:02 liuflylove666

IPVS: conn lookup: [2] TCP 10.129.39.104/80 -> 10.129.39.16/1029 hit NEIGHBOUR: [02] neighbor (ip=10.129.39.104, mac=00:50:56:b6:ff:a3, dpdk0, que_num=0, state=2, ts=0, flag=0x1) trans state: DELAY -> REACHABLE, idx:2. NEIGHBOUR: [03] neighbor (ip=10.129.39.104, mac=00:50:56:b6:ff:a3, dpdk0, que_num=0, state=2, ts=0, flag=0x1) trans state: PROBE -> REACHABLE, idx:1. NEIGHBOUR: [04] neighbor (ip=10.129.39.104, mac=00:50:56:b6:ff:a3, dpdk0, que_num=0, state=2, ts=0, flag=0x1) trans state: PROBE -> REACHABLE, idx:1. NEIGHBOUR: [01] neighbor (ip=10.129.39.104, mac=00:50:56:b6:ff:a3, dpdk0, que_num=0, state=2, ts=0, flag=0x1) trans state: PROBE -> REACHABLE, idx:1. IPVS: conn lookup: [2] TCP 10.129.39.79/25220 -> 10.129.39.84/80 hit IPVS: state trans: TCP in [..A.] 10.129.39.79:25220->10.129.39.104:80 state SYN_RECV->ESTABLISHED conn.refcnt 2 IPVS: conn lookup: [2] TCP 10.129.39.79/25220 -> 10.129.39.84/80 hit IPVS: conn lookup: [2] TCP 10.129.39.104/80 -> 10.129.39.16/1029 hit IPVS: conn lookup: [2] TCP 10.129.39.104/80 -> 10.129.39.16/1029 hit IPVS: conn lookup: [2] TCP 10.129.39.104/80 -> 10.129.39.16/1029 hit IPVS: conn lookup: [2] TCP 10.129.39.104/80 -> 10.129.39.16/1029 hit IPVS: conn lookup: [2] TCP 10.129.39.104/80 -> 10.129.39.16/1029 hit IPVS: conn lookup: [2] TCP 10.129.39.79/25220 -> 10.129.39.84/80 hit IPVS: conn lookup: [2] TCP 10.129.39.79/25220 -> 10.129.39.84/80 hit IPVS: conn lookup: [2] TCP 10.129.39.79/25220 -> 10.129.39.84/80 hit IPVS: conn lookup: [2] TCP 10.129.39.79/25220 -> 10.129.39.84/80 hit IPVS: conn lookup: [2] TCP 10.129.39.79/25220 -> 10.129.39.84/80 hit IPVS: state trans: TCP in [.FA.] 10.129.39.79:25220->10.129.39.104:80 state ESTABLISHED->CLOSE_WAIT conn.refcnt 2 IPVS: conn lookup: [2] TCP 10.129.39.104/80 -> 10.129.39.16/1029 hit IPVS: state trans: TCP out [.FA.] 10.129.39.79:25220->10.129.39.104:80 state CLOSE_WAIT->TIME_WAIT conn.refcnt 2 IPVS: conn lookup: [2] TCP 10.129.39.79/25220 -> 10.129.39.84/80 hit IPVS: conn lookup: [2] TCP 10.129.39.79/25222 -> 10.129.39.84/80 miss IPVS: new conn: [2] TCP 10.129.39.79/25222 10.129.39.84/80 10.129.39.16/1033 10.129.39.104/80 refs 2 IPVS: state trans: TCP in [S...] 10.129.39.79:25222->10.129.39.104:80 state NONE->SYN_RECV conn.refcnt 2 NEIGHBOUR: neigh_output: [2] port dpdk0, nexthop 10.129.39.104 IPVS: conn lookup: [1] TCP 10.129.39.104/80 -> 10.129.39.16/1033 miss IPVS: tcp_conn_sched: [1] try sched non-SYN packet: [S.A.] 10.129.39.104/80->10.129.39.16/1033 IPVS: conn lookup: [2] TCP 10.129.39.79/25222 -> 10.129.39.84/80 hit NEIGHBOUR: neigh_output: [2] port dpdk0, nexthop 10.129.39.104 IPVS: conn lookup: [1] TCP 10.129.39.104/80 -> 10.129.39.16/1033 miss IPVS: tcp_conn_sched: [1] try sched non-SYN packet: [S.A.] 10.129.39.104/80->10.129.39.16/1033 IPVS: conn lookup: [1] TCP 10.129.39.104/80 -> 10.129.39.16/1033 miss IPVS: tcp_conn_sched: [1] try sched non-SYN packet: [S.A.] 10.129.39.104/80->10.129.39.16/1033 IPVS: conn lookup: [2] TCP 10.129.39.79/25222 -> 10.129.39.84/80 hit NEIGHBOUR: neigh_output: [2] port dpdk0, nexthop 10.129.39.104 IPVS: conn lookup: [1] TCP 10.129.39.104/80 -> 10.129.39.16/1033 miss IPVS: tcp_conn_sched: [1] try sched non-SYN packet: [S.A.] 10.129.39.104/80->10.129.39.16/1033 IPVS: conn lookup: [1] TCP 10.129.39.104/80 -> 10.129.39.16/1033 miss IPVS: tcp_conn_sched: [1] try sched non-SYN packet: [S.A.] 10.129.39.104/80->10.129.39.16/1033 IPVS: del conn: [2] TCP 10.129.39.79/25220 10.129.39.84/80 10.129.39.16/1029 10.129.39.104/80 refs 0 IPVS: del conn: [3] TCP 10.129.39.79/25200 10.129.39.84/80 10.129.39.16/1030 10.129.39.104/80 refs 0

hit的时候是可以通的,miss的不通,大部分是miss,请问什么问题?

liuflylove666 avatar Feb 11 '22 10:02 liuflylove666

机器网卡不支持 DPVS 需要的 flow 类型。使用一个转发 worker,或者打开 conn/redirect 配置吧。

ywc689 avatar Feb 11 '22 10:02 ywc689

@ywc689 可以了,谢谢

liuflylove666 avatar Feb 11 '22 10:02 liuflylove666

@ywc689 我是在云环境上部署的DPVS18.11.11,网卡是这种virtio-net-pci,看DPDK的官网上面 [v2,3/4] net/virtio: reject unsupported Rx multi queue modes 对于这种virtio-net-pci虚拟设备也是不支持多队列的,原因是虚拟virtio没有实现各种fdir算法吧,应该是。 DPDK针对这种设备也就没有实现filter_ctrl方法,所以DPVS部署full-NAT之后,DPVS调用netif_fdir_filter_set,调用dpdk_set_fdir_filt,调用rte_eth_dev_filter_ctrl,调用filter_ctrl;DPVS是通过上述调用链来配置lport & mask = port_base的这种fdir算法的。

按常理讲,DPVS部署之后,设置fdir是无效的,因为virtio-net-pci的filter_ctrl根本没有,但是实验结果让我大吃一惊,DPVS回包精准的回到了正确的队列,没有一个miss的。 下面是回包的日志打印

IPV4: ipv4_rcv: [5] port 0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 64 prot 6 src 172.31.5.103 dst 172.31.0.217 IPVS: conn lookup: [5] TCP 172.31.5.103/80 -> 172.31.0.217/1028 hit IPV4: ipv4_rcv: [3] port 0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 64 prot 6 src 172.31.5.102 dst 172.31.0.217 IPVS: conn lookup: [3] TCP 172.31.5.102/80 -> 172.31.0.217/1026 hit IPV4: ipv4_rcv: [2] port 0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 64 prot 6 src 172.31.5.100 dst 172.31.0.217 IPVS: conn lookup: [2] TCP 172.31.5.100/80 -> 172.31.0.217/1025 hit

mask:0x7 lcore_id:5对应port_base:4 验证下:1028 & 0x7 = 4 对应的port_base:4. lcore_id:3对应port_base:2 验证下:1026 & 0x7 = 2 对应的port_base:2. lcore_id:2对应port_base:1 验证下:1025 & 0x7 = 1 对应的port_base:1.

这是什么问题,对于virtio这种虚拟设备,配不配置fdir都行吗?

YiminXia avatar Feb 12 '22 03:02 YiminXia