amzn-drivers
amzn-drivers copied to clipboard
AF_XDP zero-copy makes driver reset
We are trying to compare the performance of AF_XDP socket with regular (AF_INET) socket on AWS EC2 instances. On ena driver version 2.7.1, driver reset occurs when a load is applied using AF_XDP zero-copy mode. dmesg command shows the following message:
[Fri May 20 15:22:26 2022] ena 0000:00:06.0: Device reset completed successfully, Driver info: Elastic Network Adapter (ENA) v2.7.1g
I have attached a sample program for reproduction and the full output of dmesg.
You can reproduce this behavior by following the steps below:
Server instance
- build AF_XDP sample program
$ sudo yum install libbpf libbpf-devel clang llvm gcc bpftool kernel-headers make
$ tar xf process-udp-packet.tar.gz
$ cd process-udp-packet
$ make
- config NIC
We are using eth1 for test.
$ sudo ethtool -L eth1 combined 1
$ sudo ip link set dev eth1 mtu 3498
- run AF_XDP sample program
This program listens on port 13333/udp.
$ sudo ./af_xdp_user -d eth1 --filename ./af_xdp_kern.o -N -z -p
Client instance
- install iperf package
$ wget https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/i/iperf-2.1.6-2.el8.x86_64.rpm
$ sudo yum localinstall iperf-2.1.6-2.el8.x86_64.rpm
- send UDP packets to server by iperf in 128 threads for 180 seconds
$ iperf -c <Server IP address> -p 13333 -u -l 512 -t 180 -P 128
Our environment is:
- AMI: Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type
- Instance type: c6i.2xlarge
- ena driver version: 2.7.1g (build from source)
- Add eth1 to each instance for testing, and connect to the same private network.
Both the server and client instances have the same specifications.
Thanks,
Hi @akhota
Thank you for raising this issue and sharing the logs, we'll look into it and provide feedback.
Hi @davidarinzon
We have updated ena driver to 2.7.2 on our instances and retried AF_XDP zero-copy mode, but device reset still occurred.
According to sar command, the number of received packets of the iperf instance is smaller than the number of packets sent by the AF_XDP server instance. Maybe packets lossed somewhere, and it seems that a device reset has occurred after the packet loss occurred.
- the AF_XDP server instance
$ sar -n DEV 1
(snip)
09:32:41 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
09:32:42 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:32:42 eth0 18.00 11.00 1.16 2.62 0.00 0.00 0.00
09:32:42 eth1 25601.00 25634.00 13850.11 13868.42 0.00 0.00 0.00
(snip)
- the iperf instance
$ sar -n DEV 1
(snip)
09:32:41 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
09:32:42 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:32:42 eth0 4.00 4.00 0.23 0.80 0.00 0.00 0.00
09:32:42 eth1 16835.00 25602.00 8877.43 13851.08 0.00 0.00 0.00
(snip)
Thanks,
@akhota thanks a lot for testing our AF XDP support implementation.
After doing some additional tests we see that the issue indeed reproduces on our machines as well and we're actively working on root-causing and solving the issue.
Will update this ticket soon with a possible solution, sorry for the inconvenience
Hi @ShayAgros, Is there something that we can help you with? For example, additional testing in our environment.
Hi, I think I've identified all the issues with it and I'm currently testing out internally a fix for these issues. If the current AF XDP issues bock your progress, please write me at [email protected] where I could provide you a tentative fix for the issues at hand.
Once the testing phase ends I'll post the fix on this thread. Also we hope that by the next driver version release a new version of AF XDP support would be published which fixes some of the wrong design assumptions done in this version. I'm sorry for the inconvenience caused by this buggy experience
Hi @ShayAgros,
OK, we are looking forward to the next version. Thanks,
Hi, The AF XDP design would change by the next version (2.8) since some assumptions made with the current design were discovered to be incorrect. We're sorry for the great inconvenience we caused by introducing this incomplete implementation.
If you'd still like to test the AF XDP implementation, you can use the patch 0001-linux-ena-Fix-some-bugs-in-AF-XDP-support.patch.txt
on top of the latest current version (2.7.3) (e.g. using git am ./0001-linux-ena-Fix-some-bugs-in-AF-XDP-support.patch.txt
). I modified the driver version to 2.7.4 to allow distinguishing the modified driver.
By default the driver would compile without native (zero-copy) AF XDP support. To enable it please specify TEST_AF_XDP
envar when compiling the driver, e.g. TEST_AF_XDP=1 make
.
Please note that the AF XDP is currently in testing phase. We tested it thoroughly with this patch, but if still some issues are discovered or if you have a question then feel free to comment on this thread or write me to my email (listed above)
Hi @ShayAgros,
Thank you for the patch. We will apply your patch to our instances until the next version is released, and retry the AF XDP zero-copy performance test.
Please note that the AF XDP is currently in testing phase. We tested it thoroughly with this patch, but if still some issues are discovered or if you have a question then feel free to comment on this thread or write me to my email (listed above)
OK, I understand.
Thanks,
Hi, Now we can develop and test our AF XDP programs in the AWS environment. Thank you very much!
I have updated the ena driver and retried performance test in the AF XDP native zero-copy mode. Device reset did not occur in version 2.7.4. On the other hand, it seems that the performance of version 2.7.4 native zero-copy mode is lower than version 2.7.2 native copy mode.
The summary of results is follows:
- ena version 2.7.4 native
zero-copy
mode
packet size | Tx rate (Gbps) | total pps (Mpps) |
---|---|---|
64 | 0.45 | 0.67 |
128 | 0.78 | 0.66 |
256 | 1.45 | 0.66 |
512 | 2.81 | 0.66 |
1024 | 3.26 | 0.39 |
2048 | 4.75 | 0.29 |
3498 | 7.05 | 0.25 |
- ena version 2.7.2 native
copy
mode
packet size | Tx rate (Gbps) | total pps (Mpps) |
---|---|---|
64 | 0.85 | 1.26 |
128 | 1.50 | 1.27 |
256 | 2.80 | 1.27 |
512 | 5.31 | 1.25 |
1024 | 9.69 | 1.16 |
2048 | 12.61 | 0.76 |
3498 | 12.58 | 0.45 |
(We used TRex for measurement and packet generation.)
We suppose the patch 0001-linux-ena-Fix-some-bugs-in-AF-XDP-support.patch.txt
prevents device resets, but reduces performance of the AF XDP native zero-copy mode.
Will the next version 2.8 improve performance?
Hi @akhota
Thank you for performing the checks on the provided patch and summarizing the results. We will analyze them and provide additional information.
Hi, we were seeing the same issue (device reset) as this. But this issue hasn't been updated for a while. Just wondering is there any new AF_XDP patch which can provide better performance than copy mode?
Hi @Li-Xiaoyun,
(Also answering @akhota) We're in the process of investigating the AF_XDP performance.
If needed, the patch discussed in this comment was adjusted for 2.8.0 release and is available here.
Thanks
Thanks for the info. @davidarinzon
Hi @davidarinzon, thanks for the patch. Just for your reference, I've tested in our env. It's a simple setup with one instance running trex and another running DPDK af_xdp pmd. Both instances are using c5n4xlarge. AF_XDP PMD are using 3 queues and non-busy polling mode. zero copy works stably not having device reset issue but unfortunately, perf is still worse than copy mode. copy mode can reach ~15Gbps with packet size 1420 while zero copy can reach ~11Gbps. (2.8.0+patch)
@Li-Xiaoyun @akhota just wondering if either of you also measured latency (as opposed to throughput) of the zero-copy
mode patch vs copy
mode?
@davidarinzon have you had any success inverstigating the mentioned performance issue?
I didn't measure latency.
Any update on the AF_XDP zero copy changes? Looks like they've not yet been merged
Hi @pstavirs,
You are correct they have not been merged. We don't currently have a specific timeline that we can share with you. Once they are released we will update this ticket.
Thanks, Arthur
Hi, any updates?
Hi @oicnysa
Thanks for reaching out. We are working on this, but have no specific timeline to share at the moment. Please stay tuned for updates.
@akiyano maybe you have some specific branch which we could test?
@oicnysa I would be interested in that aswell! Would be great to test out with openonload
@oicnysa and @moscovium115, It's great to know there is interest in AF_XDP and we are taking this into account. However we can't currently share anything new. As @davidarinzon already said, please stay tuned for updates.
For those who want to experiment with AF_XDP, the original patch posted in this comment was developed on top of 2.8.0 release. An updated patch on top of the latest driver release (2.12.0) is available here, please apply it when using AF_XDP. (Please note that the patch has been updated on 04/16/24).
@davidarinzon Amazing thank you
Hi,
Official AF_XDP support was released with 2.13.0g. You no longer need to use the patches.
@akhota and others, please let us know if you face any issues.
Resolving this ticket, please re-open it in case you face any new issues with AF_XDP