amzn-drivers icon indicating copy to clipboard operation
amzn-drivers copied to clipboard

AF_XDP zero-copy makes driver reset

Open akhota opened this issue 2 years ago • 26 comments

We are trying to compare the performance of AF_XDP socket with regular (AF_INET) socket on AWS EC2 instances. On ena driver version 2.7.1, driver reset occurs when a load is applied using AF_XDP zero-copy mode. dmesg command shows the following message:

[Fri May 20 15:22:26 2022] ena 0000:00:06.0: Device reset completed successfully, Driver info: Elastic Network Adapter (ENA) v2.7.1g

I have attached a sample program for reproduction and the full output of dmesg.

You can reproduce this behavior by following the steps below:

Server instance

  1. build AF_XDP sample program
$ sudo yum install libbpf libbpf-devel clang llvm gcc bpftool kernel-headers make

$ tar xf process-udp-packet.tar.gz
$ cd process-udp-packet
$ make
  1. config NIC

We are using eth1 for test.

$ sudo ethtool -L eth1 combined 1
$ sudo ip link set dev eth1 mtu 3498
  1. run AF_XDP sample program

This program listens on port 13333/udp.

$ sudo ./af_xdp_user -d eth1 --filename ./af_xdp_kern.o -N -z -p

Client instance

  1. install iperf package
$ wget https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/i/iperf-2.1.6-2.el8.x86_64.rpm
$ sudo yum localinstall iperf-2.1.6-2.el8.x86_64.rpm
  1. send UDP packets to server by iperf in 128 threads for 180 seconds
$ iperf -c <Server IP address> -p 13333 -u -l 512 -t 180 -P 128

Our environment is:

  • AMI: Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type
  • Instance type: c6i.2xlarge
  • ena driver version: 2.7.1g (build from source)
  • Add eth1 to each instance for testing, and connect to the same private network.

Both the server and client instances have the same specifications.

Thanks,

akhota avatar May 20 '22 08:05 akhota

Hi @akhota

Thank you for raising this issue and sharing the logs, we'll look into it and provide feedback.

davidarinzon avatar May 22 '22 10:05 davidarinzon

Hi @davidarinzon

We have updated ena driver to 2.7.2 on our instances and retried AF_XDP zero-copy mode, but device reset still occurred.

According to sar command, the number of received packets of the iperf instance is smaller than the number of packets sent by the AF_XDP server instance. Maybe packets lossed somewhere, and it seems that a device reset has occurred after the packet loss occurred.

  • the AF_XDP server instance
$ sar -n DEV 1
(snip)
09:32:41        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
09:32:42           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:32:42         eth0     18.00     11.00      1.16      2.62      0.00      0.00      0.00
09:32:42         eth1  25601.00  25634.00  13850.11  13868.42      0.00      0.00      0.00
(snip)
  • the iperf instance
$ sar -n DEV 1
(snip)
09:32:41        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
09:32:42           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:32:42         eth0      4.00      4.00      0.23      0.80      0.00      0.00      0.00
09:32:42         eth1  16835.00  25602.00   8877.43  13851.08      0.00      0.00      0.00
(snip)

Thanks,

akhota avatar Jun 07 '22 09:06 akhota

@akhota thanks a lot for testing our AF XDP support implementation.

After doing some additional tests we see that the issue indeed reproduces on our machines as well and we're actively working on root-causing and solving the issue.

Will update this ticket soon with a possible solution, sorry for the inconvenience

ShayAgros avatar Jun 08 '22 12:06 ShayAgros

Hi @ShayAgros, Is there something that we can help you with? For example, additional testing in our environment.

akhota avatar Jun 23 '22 05:06 akhota

Hi, I think I've identified all the issues with it and I'm currently testing out internally a fix for these issues. If the current AF XDP issues bock your progress, please write me at [email protected] where I could provide you a tentative fix for the issues at hand.

Once the testing phase ends I'll post the fix on this thread. Also we hope that by the next driver version release a new version of AF XDP support would be published which fixes some of the wrong design assumptions done in this version. I'm sorry for the inconvenience caused by this buggy experience

ShayAgros avatar Jun 25 '22 06:06 ShayAgros

Hi @ShayAgros,

OK, we are looking forward to the next version. Thanks,

akhota avatar Jun 27 '22 01:06 akhota

Hi, The AF XDP design would change by the next version (2.8) since some assumptions made with the current design were discovered to be incorrect. We're sorry for the great inconvenience we caused by introducing this incomplete implementation.

If you'd still like to test the AF XDP implementation, you can use the patch 0001-linux-ena-Fix-some-bugs-in-AF-XDP-support.patch.txt

on top of the latest current version (2.7.3) (e.g. using git am ./0001-linux-ena-Fix-some-bugs-in-AF-XDP-support.patch.txt ). I modified the driver version to 2.7.4 to allow distinguishing the modified driver.

By default the driver would compile without native (zero-copy) AF XDP support. To enable it please specify TEST_AF_XDP envar when compiling the driver, e.g. TEST_AF_XDP=1 make.

Please note that the AF XDP is currently in testing phase. We tested it thoroughly with this patch, but if still some issues are discovered or if you have a question then feel free to comment on this thread or write me to my email (listed above)

ShayAgros avatar Jun 29 '22 11:06 ShayAgros

Hi @ShayAgros,

Thank you for the patch. We will apply your patch to our instances until the next version is released, and retry the AF XDP zero-copy performance test.

Please note that the AF XDP is currently in testing phase. We tested it thoroughly with this patch, but if still some issues are discovered or if you have a question then feel free to comment on this thread or write me to my email (listed above)

OK, I understand.

Thanks,

akhota avatar Jul 01 '22 07:07 akhota

Hi, Now we can develop and test our AF XDP programs in the AWS environment. Thank you very much!

I have updated the ena driver and retried performance test in the AF XDP native zero-copy mode. Device reset did not occur in version 2.7.4. On the other hand, it seems that the performance of version 2.7.4 native zero-copy mode is lower than version 2.7.2 native copy mode.

The summary of results is follows:

  • ena version 2.7.4 native zero-copy mode
packet size Tx rate (Gbps) total pps (Mpps)
64 0.45 0.67
128 0.78 0.66
256 1.45 0.66
512 2.81 0.66
1024 3.26 0.39
2048 4.75 0.29
3498 7.05 0.25
  • ena version 2.7.2 native copy mode
packet size Tx rate (Gbps) total pps (Mpps)
64 0.85 1.26
128 1.50 1.27
256 2.80 1.27
512 5.31 1.25
1024 9.69 1.16
2048 12.61 0.76
3498 12.58 0.45

(We used TRex for measurement and packet generation.)

We suppose the patch 0001-linux-ena-Fix-some-bugs-in-AF-XDP-support.patch.txt prevents device resets, but reduces performance of the AF XDP native zero-copy mode. Will the next version 2.8 improve performance?

akhota avatar Jul 13 '22 10:07 akhota

Hi @akhota

Thank you for performing the checks on the provided patch and summarizing the results. We will analyze them and provide additional information.

davidarinzon avatar Jul 20 '22 18:07 davidarinzon

Hi, we were seeing the same issue (device reset) as this. But this issue hasn't been updated for a while. Just wondering is there any new AF_XDP patch which can provide better performance than copy mode?

Li-Xiaoyun avatar Oct 18 '22 10:10 Li-Xiaoyun

Hi @Li-Xiaoyun,

(Also answering @akhota) We're in the process of investigating the AF_XDP performance.

If needed, the patch discussed in this comment was adjusted for 2.8.0 release and is available here.

Thanks

davidarinzon avatar Oct 19 '22 07:10 davidarinzon

Thanks for the info. @davidarinzon

Li-Xiaoyun avatar Oct 19 '22 08:10 Li-Xiaoyun

Hi @davidarinzon, thanks for the patch. Just for your reference, I've tested in our env. It's a simple setup with one instance running trex and another running DPDK af_xdp pmd. Both instances are using c5n4xlarge. AF_XDP PMD are using 3 queues and non-busy polling mode. zero copy works stably not having device reset issue but unfortunately, perf is still worse than copy mode. copy mode can reach ~15Gbps with packet size 1420 while zero copy can reach ~11Gbps. (2.8.0+patch)

Li-Xiaoyun avatar Oct 19 '22 18:10 Li-Xiaoyun

@Li-Xiaoyun @akhota just wondering if either of you also measured latency (as opposed to throughput) of the zero-copy mode patch vs copy mode?

@davidarinzon have you had any success inverstigating the mentioned performance issue?

nirvana-msu avatar Jan 18 '23 14:01 nirvana-msu

I didn't measure latency.

Li-Xiaoyun avatar Jan 18 '23 14:01 Li-Xiaoyun

Any update on the AF_XDP zero copy changes? Looks like they've not yet been merged

pstavirs avatar Dec 14 '23 06:12 pstavirs

Hi @pstavirs,

You are correct they have not been merged. We don't currently have a specific timeline that we can share with you. Once they are released we will update this ticket.

Thanks, Arthur

akiyano avatar Dec 14 '23 14:12 akiyano

Hi, any updates?

oicnysa avatar Mar 12 '24 16:03 oicnysa

Hi @oicnysa

Thanks for reaching out. We are working on this, but have no specific timeline to share at the moment. Please stay tuned for updates.

davidarinzon avatar Mar 14 '24 06:03 davidarinzon

@akiyano maybe you have some specific branch which we could test?

oicnysa avatar Mar 22 '24 10:03 oicnysa

@oicnysa I would be interested in that aswell! Would be great to test out with openonload

moscovium115 avatar Mar 23 '24 01:03 moscovium115

@oicnysa and @moscovium115, It's great to know there is interest in AF_XDP and we are taking this into account. However we can't currently share anything new. As @davidarinzon already said, please stay tuned for updates.

akiyano avatar Mar 23 '24 13:03 akiyano

For those who want to experiment with AF_XDP, the original patch posted in this comment was developed on top of 2.8.0 release. An updated patch on top of the latest driver release (2.12.0) is available here, please apply it when using AF_XDP. (Please note that the patch has been updated on 04/16/24).

davidarinzon avatar Mar 28 '24 13:03 davidarinzon

@davidarinzon Amazing thank you

moscovium115 avatar Mar 29 '24 08:03 moscovium115

Hi,

Official AF_XDP support was released with 2.13.0g. You no longer need to use the patches.

@akhota and others, please let us know if you face any issues.

davidarinzon avatar Sep 16 '24 17:09 davidarinzon

Resolving this ticket, please re-open it in case you face any new issues with AF_XDP

davidarinzon avatar Oct 04 '24 17:10 davidarinzon