nuttx modify iob to support header padding and alignment features

Summary

modify iob to support buffer aligment and head padding features

Impact

1.if config CONFIG_IOB_ALIGNMENT to 1 and CONFIG_IOB_HEADSIZE to 0, the iob_s initialization and memory usage are the same as before，no extra bytes are wasted. 2.In order to improve the performance on some platforms, the zero-copy mechanism can be used in the data transmission path, and the IP packets sent by TCPIP stack are directly sent to the HW for transmission, in this case, one feature is that cache line alignment may be required at the beginning of the buffer, at the same time we may need to provide a memory space for the HW controller (so that the HW controller can fill some data, such as the sending and receiving of USB RNDIS packets, the USB HW needs to put a 42-byte header before the original Ethernet frame), so we designed the io_head field in iob_s structure, and followed the principle of using io_data as the data buffer, we can configure the io_head field to align to the specified memory boundary. In addition, if the io_head length is also configured as an integer multiple of the cache line, then io_data can also be aligned to the cache line.

Testing

On the nuttx simulator, configure arbitrary io_head size and alignment size, wget works fine, configure the io_head size to 0 and the alignment size to 1, no extra bytes are wasted.

Aug 11 '22 10:08 luojun1234

IOB is shared mechanism used for different stacks. So on a multi-stack system where CONFIG_IOB_HEADSIZE>0 all stacks are affected, maybe this is even a design limitation of the current IOB design. And could use some refactoring/redesign.

But for the moment isn't there any way to introduce user storage for iob without affecting all the iob's? A rough idea from my side would just be using an union exclusively for the TCP side and not affect the others i.e. And just the pointer when using it in the TCP stack

union iob_s_tcp_playload{
   uint_ptr user;
   uint8_t  io_data[CONFIG_IOB_BUFSIZE-sizeof(uint_ptr)];
} iob_s_tcp_playload;  

struct iob_s incoming_data;

// Fetch iob

iob_s_tcp_playload* tcp_iob = (iob_s_tcp_playload*)(&incoming_data.io_data);

struct custom_struct = (struct custom_struct*)tcp_iob.user;
memcpy(dst, &tcp_io.iodata[0], sizeof(iob_s_tcp_playload.io_data));

If you guys could share the TCP changes that need this I can give better input on this dicussion, or maybe include this change with this TCP change.

Aug 11 '22 11:08 PetervdPerk-NXP

@pkarashchenko @PetervdPerk @davids5 please review this new patch instead.

Aug 11 '22 11:08 xiaoxiang781216

IOB is shared mechanism used for different stacks. So on a multi-stack system where CONFIG_IOB_HEADSIZE>0 all stacks are affected, maybe this is even a design limitation of the current IOB design. And could use some refactoring/redesign.

But for the moment isn't there any way to introduce user storage for iob without affecting all the iob's? A rough idea from my side would just be using an union exclusively for the TCP side and not affect the others i.e. And just the pointer when using it in the TCP stack
union iob_s_tcp_playload{
   uint_ptr user;
   uint8_t  io_data[CONFIG_IOB_BUFSIZE-sizeof(uint_ptr)];
} iob_s_tcp_playload;  

struct iob_s incoming_data;

// Fetch iob

iob_s_tcp_playload* tcp_iob = (iob_s_tcp_playload*)(&incoming_data.io_data);

struct custom_struct = (struct custom_struct*)tcp_iob.user;
memcpy(dst, &tcp_io.iodata[0], sizeof(iob_s_tcp_playload.io_data));
If you guys could share the TCP changes that need this I can give better input on this dicussion, or maybe include this change with this TCP change.

In order to support high throughput, we have designed a zero-copy solution for the data transmission path in our system. Typically, the MTU in our environment is also 1500 bytes.

On one core, the iob_s buffer is set to be used directly by the hardware accelerator. We reserve 32 bytes of space in front of the 1500-length payload buffer for the hardware accelerator to use. To ensure data consistency (between RAM and CPU Cache), the reserved 32-byte space needs to be aligned according to the CPU cache line size. From the perspective of data flow, the iob data sent by the tcpip stack is directly consumed by the HW, and the iob data received from the HW is sent to the TCPIP stack for consumption, and only copied when it needs to be sent to the application.
On another core, it may be need to forward some IP frames received from one network driver to the USB RNDIS network driver, which is also a zero-copy implementation. We do not copy the data content during the forwarding process, when the ethernet packet is sent to the USB HW, a 42-byte space aligned with the current CPU cache line size is reserved before the Ethernet frame.

As mentioned above, one core needs cache line-aligned iob buffers of 1500+32 bytes, and the other core needs cache line-aligned iob buffers of 1500+42 bytes. Because iob has only one size configuration, other stacks can only use this size of iob buffer.

Aug 11 '22 13:08 luojun1234

IOB is shared mechanism used for different stacks. So on a multi-stack system where CONFIG_IOB_HEADSIZE>0 all stacks are affected, maybe this is even a design limitation of the current IOB design. And could use some refactoring/redesign.

The redesign need extend the iob_buffer which was denied in #6780. In the current implementation, the default config keep the size as before, but the netdev implementor could extend the header or the alignment, which is a best compromise we can find without the major re-architecture.

But for the moment isn't there any way to introduce user storage for iob without affecting all the iob's? A rough idea from my side would just be using an union exclusively for the TCP side and not affect the others i.e. And just the pointer when using it in the TCP stack

this approach insert a uint_ptr field but how to allocate the real memory to this pointer? Another mem pool?

union iob_s_tcp_playload{
   uint_ptr user;
   uint8_t  io_data[CONFIG_IOB_BUFSIZE-sizeof(uint_ptr)];
} iob_s_tcp_playload;  

struct iob_s incoming_data;

// Fetch iob

iob_s_tcp_playload* tcp_iob = (iob_s_tcp_playload*)(&incoming_data.io_data);

struct custom_struct = (struct custom_struct*)tcp_iob.user;
memcpy(dst, &tcp_io.iodata[0], sizeof(iob_s_tcp_playload.io_data));

You can't do this since IP layer expect io_data start at the same location for all TCP/UDP/ICMP/ICMPv6.

If you guys could share the TCP changes that need this I can give better input on this dicussion, or maybe include this change with this TCP change.

The goal of this patch is different from https://github.com/apache/incubator-nuttx/pull/6780/commits/d6affff1241004452267921f5291cfcacc39653d

This patch fix the netdev special requirement:

The cache alignment of io_data to improve the performance
Reserve the space before io_data for MAC layer to add the L2 header.

But the preserved header could be used for TCP/UDP/IP out of band data potentially.

Aug 11 '22 14:08 xiaoxiang781216

A Zero-copy ethernet solution would be indeed a great addition, when implementing a enet driver myself I noticed as well that packets are double stored and have a unnecessary copy action, furthermore hw TCP/IP offloading (HW checksum etc) is also thing NuttX doesn't support that well.

But wouldn't it be better to come up with a proposal to offload DMA directly into a IOB buffer and start a discussion (i.e. NuttX mailing list) how we can make this portable in such a way that it works all kinds of ethernet MAC's that have build-in DMA. I think this requires more changes into how IOB works but also the API of the NuttX Networking to work with these changes.

Aug 11 '22 14:08 PetervdPerk-NXP

A Zero-copy ethernet solution would be indeed a great addition, when implementing a enet driver myself I noticed as well that packets are double stored and have a unnecessary copy action,

The header and alignment is part of the offloading solution. The current design keep the old behavior as before, but give the implementation the possibility.

furthermore hw TCP/IP offloading (HW checksum etc) is also thing NuttX doesn't support that well.

HW checksum is already supported, you can simply enable NET_ARCH_CHKSUM: https://github.com/apache/incubator-nuttx/blob/master/net/utils/Kconfig#L6-L26

But wouldn't it be better to come up with a proposal to offload DMA directly into a IOB buffer and start a discussion (i.e. NuttX mailing list)

It's better to show the code and discuss the change like this PR. It's hard to talk a proposal without the code in email thread.

how we can make this portable in such a way that it works all kinds of ethernet MAC's that have build-in DMA.

The offload change is simple from the concept:

TCP/IP stack pass IOB directly to netdev instead copy to d_buf
netdev request the hardware send the data in IOB and free IOB to pool after done.
The netdev allocate IOB and pass to hardware for receiving
Pass IOB to ipv4_input/ipv6_input. TCP/IP stack will release the buffer after the data copy to the user buffer

Of course, to keep compatibility, the new behavior need enabled by some flag.

I think this requires more changes into how IOB works but also the API of the NuttX Networking to work with these changes.

Could you point out the detail change required for IOB(except the alignment, header and section)?

Aug 11 '22 17:08 xiaoxiang781216

@PetervdPerk-NXP @davids5 before we apply the complex dynamic allocation of IOB buffer, the approach used in this PR can keep the back compatibility and zero cost, and give the hardware possibility to implement the support of offloading.

I would suggest that we implement the offload feature with the current IOB design in the first phase, and then extend the dynamic allocation in the next phase if you think that it's really needed.

Aug 12 '22 11:08 xiaoxiang781216