onload
onload copied to clipboard
OpenOnload on non solarflare adapter
I feel like an idiot for asking this here, could not find a conclusive answer anywhere. Is it possible to run OpenOnload on non solarflare / Xilinx hardware. Aka if a setup my hosting from AWS, is it possible to configure OpenOnload to work on the ENA driver that automatically comes with that machine?
Hi. Yes, Onload has an AF_XDP backend, which makes it possible to run it on any NIC.
AF_XDP support comes in 3 flavours, in increasing performance: generic, in-driver, and zero-copy. Last time I looked (admittedly a while ago) the ena driver supported neither in-driver nor zero-copy, so you're stuck with the slowest option, provided generically by Linux. Onload should still be faster than native sockets, though.
Tried a couple of hours today to get the setup working. Noticed that the default ENA provided by Amazon was a bit out of date. After updating, compiling from this repository and running it I had the onload_cp_server running. Adding my eth0 yielded the following issue in dmesg [ 905.128865] [sfc efhw] xdp_set_link: eth0 does not support XDP. Im a bit stuck currently, any idea why this is happening?
I was under the impression that after updating both the kernel (Red Hat) and the ENA driver I should have XDP support. Also I bit unsure how to check if my current setup supports XDP, grepping for CONFIG_XDP_SOCKET in my boot config yields it should be enabled. Was following this guide to check / enable XDP on ena https://trying2adult.com/what-is-xdp-and-how-do-you-use-it-in-linux-amazon-ec2-example/
Any idea why it looks like my eth0 isn't supporting XDP? Any help is appreciated, bit stuck atm. Thanks in advance
I'm sorry, it looks like I gave you bad info - re-checking the code it looks like we do currently need in-driver support for XDP. As I see it you have a few options:
- Give up
- Enhance Onload to make it work with the generic XDP. It's possible that this might prove difficult if, for example, the necessary kernel APIs do not exist for Onload to get the details it needs.
- Add XDP support to the ena driver. A minimal implementation may not be a massive amount of work
- Switch to an instance type with an Intel NIC. See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html
- Switch to Azure or GCP, both of which also offer NICs with better drivers
OK, perhaps I ought to stop talking, since every time I (figuratively) open my mouth I give you bad advice, but I'm going to try again:
The current ena driver (in Linus's tree) does have in-driver support for XDP, so it should work with Onload. The specific error logging you gave (eth0 does not support XDP) can only be coming out if you have an ena driver without support, so I suggest you check that.
Hi Richard, first of all thanks for the detailed response! Really appreciate you reaching out on this GH issue! Lots of information already for me to try out. Bit of background info on the previously mentioned points:
For the ENA driver im currently running version 2.5.0g, which should have support for XDP. So I'm either doing something incorrect when compiling and configuring the newer ENA driver or something went south with that specific version. Ill recheck on the amazon driver repository for any directions.
One last thing I noticed is the following when starting up open onload and loading the drivers into the kernel:
unload.sh: /sbin/rmmod onload
unload.sh: /sbin/rmmod sfc_char
unload.sh: /sbin/rmmod sfc_resource
unload.sh: /sbin/rmmod sfc_driverlink
NET_OPT is
CHAR_OPT is
modprobe: FATAL: Module mtdchar not found in directory /lib/modules/4.18.0-305.el8.x86_64
ERROR: Did not find sfc_control in /proc/devices
sfc is a RELEASE driver
RESOURCE_OPT is
CHAR_OPT is
ONLOAD_OPT is
The process appears to be running fine, just not sure whether this FATAL & ERROR logs might be a potential culprit.
PS: from a different issue on the amazon driver repository (https://github.com/amzn/amzn-drivers/issues/173) I read the following:
The driver doesn't support AF_XDP yet and the application falls back to SKB mode (kernel's generic implementation for XDP processing) which implies copying.
So even though XDP is supported by the ENA driver, it doesn't support AF_XDP yet? Might be the reason why its currently failing.
@hjastenger Fwiw (Hi from https://github.com/amzn/amzn-drivers/issues/173), the XDP related APIs should work on EC2 assuming you have recent kernel and ENA versions (I have tested a 5.11 kernel recently with ENA 2.5.0). Its just that its not particular fast at this time due to the issue you quoted.
Hi @eugeneia, thanks for the info. I'm currently using RHEL 4.18, latest kernel from RH. I've tried upgrading the kernel using something like Elrepo to 5.12.9 but this results in not being able to build onload properly. Any idea?
not sure whether this FATAL & ERROR logs might be a potential culprit
We really ought to find some way of hiding those messages. They confuse everybody and they're absolutely harmless.
The eth0 does not support XDP comes out specifically iff dev->netdev_ops->ndo_bpf == NULL. You can see at https://github.com/amzn/amzn-drivers/blob/1c30884cea0704df0850019b15045961eca975a6/kernel/linux/ena/ena_netdev.c#L4076 that this isn't the case, so my first guess would be that ENA_XDP_SUPPORT has somehow ended up off. Perhaps try sticking a #error at that line to be sure it's getting compiled-in.
I've tried upgrading the kernel using something like Elrepo to 5.12.9 but this results in not being able to build onload properly.
Onload build and works with linux-5.12 if CONFIG_VDPA is not set. I've filed an internal bug for CONFIG_VDPA issue. It usually takes a few weeks to fix such an issue. In the best case we'll get a fix next Tuesday,
Thanks for the replies all! Super helpful. Reverted my kernel upgrade back to 4.18 and swapped out the ENA for Intel (Intel(R) 10 Gigabit Virtual Function Network Driver, ixgbevf). Adding my eth0 interface doesn't yield me the eth0 does not support XDP anymore. It currently tells me:
[ 450.666568] [sfc efrm] efrm_nondl_register_device: register eth0
[ 450.673665] [sfc efrm] Using VI range 0+(0-1)<<0
[ 450.673667] [sfc efrm] eth0 type=4:
[ 450.678906] [sfc efrm] efrm_driverlink_resume:
[ 450.691265] [sfc efrm] ? hardware init failed (-22, attempt 1 of 1)
[ 450.691266] [sfc efrm] ?: ERROR: hardware init failed rc=-22
[ 450.697517] [sfc efrm] eth0 index=0 ifindex=2
[ 450.702383] [onload] oo_nic_add: ifindex=2 oo_index=0
I hope hardware init failed due to the fact that im currently not using one of the Solarflare NIC's? Running something like onload ping 8.8.8.8 still doesn't display the 'success' banner for me. Its still printing, which makes sense if this is happening due to not using a Solarflare NIC
oo:ping[6050]: netif_tcp_helper_alloc_u: ENODEV.
This error can occur if no Solarflare network interfaces
are active/UP, or they are running packed stream
firmware, are disabled or lack Onload activation keys.
Please check your configuration. To obtain activation
keys, please contact your sales representative.
oo:ping[6050]: __citp_netif_alloc: failed to construct netif (19)
oo:ping[6050]: citp_netif_alloc_and_init: failed to create netif (19)
oo:ping[6050]: citp_udp_socket: failed (errno:19) - PASSING TO OS
Have you typed echo enp1s0f0 |sudo tee /sys/module/sfc_resource/afxdp/register for your NIC?
You are referring to the step 'adding your interface' with echo ens2f0 > /sys/module/sfc_resource/afxdp/register right? Which I did for my eth0 interface, found the entry in dmesg [ 450.666568] [sfc efrm] efrm_nondl_register_device: register eth0
Do you see Accelerating ens2f0: RX 1 TX 1 message in syslog from onload_cp_server? Have you brought the interface up?
So I guess you only add ens2f0 if you have that interface, so thats why im adding eth0 instead of the placeholder from the docs. syslog does contain that entry
xxx onload_cp_server[2870]: Accelerating eth0: RX 1 TX 1
Cool. Is the interface up?
Yes the interface is up
[xx]# cat /sys/class/net/eth0/operstate
up
Im still seeing a lot of context switches with perf stat -e 'sched:sched_switch' -a -A --timeout 10000 when running something with onload, so I have the impression onload isn't functioning properly?
The canonical way to determine whether Onload is functioning properly is with the onload_stackdump tool. onload_stackdump dump while the Onloaded app is still running should print out some info for each socket that's accelerated, each under a heading vaguely akin to TCP 0:2036 lcl=192.168.138.3:48087 rmt=192.168.138.2:12345 ESTABLISHED.
Note that ping's sockets will never be accelerated by Onload (we have yet to find a use-case where anybody needed ICMP to be very efficient). For minimal testing I tend to use netcat; for performance testing iperf or its ilk are fine.
Note that ping's sockets will never be accelerated by Onload (we have yet to find a use-case where anybody needed ICMP to be very efficient).
Was only running it with ping to verify onload was running by checking the banner message. Read somewhere on the onload forum that you could easily check onload was functioning by checking for the sucess info banner.
The canonical way to determine whether Onload is functioning properly is with the onload_stackdump tool. onload_stackdump dump while the Onloaded app is still running should print out some info for each socket that's accelerated, each under a heading vaguely akin to TCP 0:2036 lcl=192.168.138.3:48087 rmt=192.168.138.2:12345 ESTABLISHED.
Have been trying out the code used by Cloudflare in one of their blog posts (https://blog.cloudflare.com/how-to-receive-a-million-packets), but ill try to validate this with the onload_stackdump you describe, thanks again for the effort all!
this thread is really helpful so thanks!
just small question where i can find onload_stackdump
~~@ronenhamias I think (assumption) he might be referring to onload_tcpdump?~~
No, I meant onload_stackdump. It should be in build/gnu_x86_64/tools/ip
thank you @rhughes-xilinx @hjastenger
Good day, gentlemens! Regarding onload_stackdump, I'm using awslinux 2 with kernel 5.4.117 and ena ver 2.5.0, but getting some strange error during onload install, which led me to nowhere:
make[2]: Entering directory `/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools'
make -C /usr/src/kernels/5.4.117-58.216.amzn2.x86_64 NDEBUG=1 GCOV= CC=cc symverfile=/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools/Module.symvers KBUILD_EXTMOD=/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools _module_/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools
make[3]: Entering directory `/usr/src/kernels/5.4.117-58.216.amzn2.x86_64'
make[4]: *** No rule to make target `_module_/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools'. Stop.
make[3]: *** [sub-make] Error 2
make[3]: Leaving directory `/usr/src/kernels/5.4.117-58.216.amzn2.x86_64'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools'
May somebody know the reason why it could appear?
Hello again
following this thread - when i am running onload_stackdump dump as suggested
i am getting:
ci onload_stackdump should not itself be run under onload acceleration.
am i missing anything?
ci onload_stackdump should not itself be run under onload acceleration.
Is the question "I definitely think I'm not running this command under Onload", "I don't understand what this is saying" or "how can I tell whether I am or not?"? My first guess would be that somewhere way up your terminal you ran onload bash and forgot.
actually was trying to verify that the installation works.
but i made a mistake and was trying to install it from https://support-nic.xilinx.com/wp/onload?sd=SF-109585-LS-35&pe=SF-122921-DH-4
and this version probably is for solarflare NIC only. and this is why its not working.
but when trying to build it from source the build fails with this error:
./openonload/scripts/onload_install
make[2]: Entering directory `/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools'
make -C /usr/src/kernels/5.4.117-58.216.amzn2.x86_64 NDEBUG=1 GCOV= CC=cc symverfile=/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools/Module.symvers KBUILD_EXTMOD=/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools _module_/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools
make[3]: Entering directory `/usr/src/kernels/5.4.117-58.216.amzn2.x86_64'
make[4]: *** No rule to make target `_module_/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools'. Stop.
make[3]: *** [sub-make] Error 2
make[3]: Leaving directory `/usr/src/kernels/5.4.117-58.216.amzn2.x86_64'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/home/ec2-user/openonload/build/x86_64_linux-5.4.117-58.216.amzn2.x86_64/lib/citools'
still trying to figure out why it fails to build - any ideas?
Any ideas how I can download that Amazon's kernel 5.4.117-58.216.amzn2? Or at least provide full build log, please. Sometimes there are some hints before the fatal error.
NB Onload is tested with Ubuntu's 5.4.0-73-generic, so I'd expect that any linux-5.4 works. Or at least builds. Probably Amazon patches their kernel, and I have no idea how. Google refused to provide me any details or download link for that Amazon's kernel. May be my google-foo is not sufficient.
Doesn't fully answer your question, as I am not sure 5.4 is available now, but you can download the kernel source from within a running instance:
yumdownloader --source kernel-5.10.75-79.358.amzn2
It includes the upstream and patches.
May be of use to test the Amazon 2 image as it rolls forward.