Adafruit_BusIO icon indicating copy to clipboard operation
Adafruit_BusIO copied to clipboard

chunked transfers for a big improvement in performance

Open eringerli opened this issue 2 years ago • 23 comments

The data is prepared by transmitting the reads and writes in chunks instead of byte for byte. A constant is used for the chunk size.

This is a tradeof between RAM/CPU and bus speed: on the 'bigger' arduino platforms the CPU is a lot faster than the SPI and there is a lot of RAM avaible, so using more RAM/CPU cycles and then letting the DMA do its work is the way to go.

The chunked transfers also combine the reads and writes, so the dead time in between is removed, which is especially important for register reads of SPI-attached chips.

Chunked transfers give an improvement of about 40% over bytewise ones, additionally +5% in the case of small reads/writes as used in BusIO_Register by removing the dead time between writing and reading. ESP32 can transfer buffers without an inter-byte delay, M4 and AVR can't, as their arduino cores do the transmission byte for byte. However, as the inner loop is farther down the stack when using the buffer transfer of the core, this delay can be shortened.

The AVR has a chunk size of 32 bytes, all other platforms 64. Especially the ESP32 does some chunking internally and uses also 64 bytes, so this shouldn't impede the hardware with additional overhead. The AVR and M4 don't do chunking, so the size is quite arbitrary, but as the AVR has limited resources and most read/writes are in the range of a typical register read of a sensor, 32 bytes are chosen.

Logic Analyser Traces

The signal "DIO 11" is used as the actually controlled CS of Adafruit_SPIDevice, "DIO 12" is set/cleared right before/after calling the member. I used "DIO 12" to measure the performance to get the overhead.

ESP32

Large Combined Transfer

Here is a transfer of write_and_read() with 70 bytes to write and 70 to read. Without this PR it takes 216us to transmit, with it only 135us. This is an improvement of 37%. As the ESP32 has a special case in the code, the writing half looks solid black, with a small interruption caused by the internal chunking.

Before

wr-70+70-master

After

wr-70+70-chunk

Small Combined Transfer

Here is a transfer of write_and_read() with 2 bytes to write and 12 to read. Without this PR is takes 41us to transmit, with it only 26us. This is an improvement of 35%.

It is plainly visible, what effect a bytewise transfer in a for()-loop has on performance. This is the case on all platforms except writing buffers on ESP32.

Before

wr-2+12-master

After

wr-2+12-chunked

Writing a single Large Buffer

Here is a transfer of write() with 70 bytes. There is a small performance degration, as the chunking is done a layer up instead of in the arduino core.

Before

70-master

After

70-chunk

Writing a single Small Buffer

Here is a transfer of write() with 12 bytes. There is a small performance degradation, at least on an ESP32.

Before

w-12-master

After

w-12-chunked

Writing two Small Buffers (2+12 bytes)

Here is a transfer of write() with 2+12 bytes. The performance is roughly the same, at least on an ESP32.

Before

2+12-master

After

2+12-chunk

Writing two Small Buffers (1+1 bytes)

Here is a transfer of write() with 1+1 bytes. The performance is roughly the same, at least on an ESP32.

Before

1+1-master

After

1+1-chunked

Feather M4 Express

Infos

Here is a call to write() then write_then_read(), with a 2 byte and 9 byte buffer. This should be a typical call from Adafruit_BusIO_Register to read some sensor. The bus speed is 10MHz. The unchunked transfer takes ~25us, the chunked one ~20us, this is an improvement of 20%.

The inter-byte time is without chunking 1.17us, with preparing the data 0.5us. It is a direct result of having the arduino core doing the inner loop instead of calling through pointers and APIs. The SERCOM implementation has only a method to transfer single bytes, so the arduino core calls this method for each byte individualy. I tested adding a method for buffer transfers, this lowers the inter-byte time to 350ns, but didn't remove it. This can be done, but requires DMA.

Before

Screenshot_20220515_214551-m4-master

After

Screenshot_20220515_214506-m4-chunk

Feather 32u4

Infos

Here is a call to write() then write_then_read(), with a 2 byte and 9 byte buffer. This should be a typical call from Adafruit_BusIO_Register to read some sensor. As the AVR is only clocked at 8MHz, I reduced the bus speed to 2MHz. The unchunked transfer takes ~168us, the chunked one ~140us, this is an improvement of 17%.

The inter-byte time is without chunking 8.5us, with 1.25us. It is a direct result of having the arduino core doing the inner loop instead of calling through pointers and APIs.

Before

Screenshot_20220515_215521-avr-master

After

Screenshot_20220515_215428-avr-chunk

eringerli avatar May 07 '22 17:05 eringerli