ShiftPWM
ShiftPWM copied to clipboard
Working SPI ISR optimizations with ~30% performance improvement
No functional changes, only local optimizations in the ShiftPWM.h
file.
Performance changes...
Saves 2 cycles per bit by replacing the existing loop enclosing a sequence of 8 calls to
add_one_pin_to_byte()
with the singe function send_spi_bytes()
.
Writing the whole send sequence in a single inline ASM allowed for explicit pre-decrement indexing. This optimization did not happen naturally inside the loop because of compiler limitations.
Old emitted ASM per bit:
add_one_pin_to_byte(sendbyte, counter, --ledPtr);
3e2: 20 81 ld r18, Z
3e4: 52 17 cp r21, r18
3e6: 47 95 ror r20
3e8: 31 97 sbiw r30, 0x01 ; 1
New emitted ASM per bit...
3d6: 02 90 ld r0, -Z
3d8: 20 15 cp r18, r0
3da: 37 95 ror r19
There is also a savings of 1 cycle per byte because the compiler redundantly compares the loop variable to zero after decrementing it.
Non-performance changes...
- Using an EOR against a preloaded register rather than a NEG invert the output bits instead of a compare and branch. This is performance neutral if the option is selected, and costs a single cycle per byte if it is not since the code would have been statically eliminated in the old version.
- Preloading the load balancing step factor into a register (0 or 8 depending on
ShiftPWM_balanceLoad
) and then always adding this to thecounter
on each loop pass. This is performance neutral if the option is selected, and costs a single cycle per byte if it is not since the code would have been statically eliminated in the old version.
These changes were motivated by keeping the ASM code clean and continuous. In order to allow static elimination based on a const
variable, I would have either had to...
- Break up the asm() into multiple parts. In my experience, this increases the chances that the compiler will mess up the emitted code, especially on older versions of the Arduino IDE.
- Make 4 versions of the
send_spi_bytes()
function to cover each case of the options being selected. This is optimal, but ugly.
Overall, I think the per-bit savings get the code fast enough that it is keeping up with the SPI hardware, so any additional savings would likely be wasted waiting for the SPI transmit to complete.
That said, if there was motivation to support 2X SPI mode, then there are some tricks we could use to keep up with that. Let me know if you think this is a relevant use case.
Ossilicope traces of the before and after outputs attached.
Thanks!
-josh