tiny-AES-c
tiny-AES-c copied to clipboard
InvMixColumn() optimization
This will optimize GF(2^8) math in InvMixColumns().
With arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 9-2019-q4-major) 9.2.1 20191025 (release)
arm-none-eabi-gcc -mcpu=cortex-m3 --function-sections -Os -c aes.c -o aes.o
arm-none-eabi-nm -S aes.o
--- BEFORE (MULTIPLY_AS_A_FUNCTION = 0) ---
00000000 00000028 t AddRoundKey
00000000 00000082 T AES_CBC_decrypt_buffer
00000000 00000054 T AES_CBC_encrypt_buffer
00000000 0000007a T AES_CTR_xcrypt_buffer
00000000 00000014 T AES_ctx_set_iv
00000000 0000000a T AES_ECB_decrypt
00000000 0000000a T AES_ECB_encrypt
00000000 00000004 T AES_init_ctx
00000000 00000022 T AES_init_ctx_iv
00000000 00000100 t Cipher
00000000 000001ac t InvCipher
00000000 0000007c t KeyExpansion
00000100 0000000b r Rcon
0000010b 00000100 r rsbox
00000000 00000100 r sbox
00000000 00000012 t xtime
--- BEFORE (MULTIPLY_AS_A_FUNCTION = 1) ---
00000000 00000028 t AddRoundKey
00000000 00000082 T AES_CBC_decrypt_buffer
00000000 00000054 T AES_CBC_encrypt_buffer
00000000 0000007a T AES_CTR_xcrypt_buffer
00000000 00000014 T AES_ctx_set_iv
00000000 0000000a T AES_ECB_decrypt
00000000 0000000a T AES_ECB_encrypt
00000000 00000004 T AES_init_ctx
00000000 00000022 T AES_init_ctx_iv
00000000 00000100 t Cipher
00000000 00000164 t InvCipher
00000000 0000007c t KeyExpansion
00000000 00000032 t Multiply
00000100 0000000b r Rcon
0000010b 00000100 r rsbox
00000000 00000100 r sbox
00000000 00000012 t xtime
--- AFTER ---
00000000 00000028 t AddRoundKey
00000000 00000082 T AES_CBC_decrypt_buffer
00000000 00000054 T AES_CBC_encrypt_buffer
00000000 0000007a T AES_CTR_xcrypt_buffer
00000000 00000014 T AES_ctx_set_iv
00000000 0000000a T AES_ECB_decrypt
00000000 0000000a T AES_ECB_encrypt
00000000 00000004 T AES_init_ctx
00000000 00000022 T AES_init_ctx_iv
00000000 00000100 t Cipher
00000000 00000140 t InvCipher
00000000 0000007c t KeyExpansion
00000100 0000000b r Rcon
0000010b 00000100 r rsbox
00000000 00000100 r sbox
00000000 00000012 t xtime
Hi @dmitrystu and thanks for contributing this optimization with great comments! Sorry for keeping you waiting so long.
I think the original purpose of the multiply-as-a-function was for 8-bit devices. But I can't remember anymore. @DamonHD do you have any insights into the implications for the 8-bit crowd?
That's not a problem. I'll try to explain this commit to you.
No matter what's the reason to use Multiply as a function or as a macro. This commit optimizes a number of the math operations.
Each Multiply(x,y) over GF8 generally uses 10 multiply ⊗2 and 4 ⊕. In some cases, the number of operations will be reduced by the compiler.
So, originally A' = (A ⊗ 0x0e) ⊕ (B ⊗ 0x0b) ⊕ (C ⊗ 0x0d) ⊕ (D ⊗ 0x09)
Since 0x0E = 8 ⊕ 4 ⊕ 2 -> A⊗0x0E = (A⊗8) ⊕ (A⊗4) ⊕ (A⊗2)
Therefore
A' = (A⊗8 ⊕ A⊗4 ⊕ A⊗2) ⊕ (B⊗8 ⊕ B⊗2 ⊕ B) ⊕ (C⊗8 ⊕ C⊗4 ⊕ C) ⊕ (D⊗8 ⊕ D)
B' = (A⊗8 ⊕ A) ⊕ (B⊗8 ⊕ B⊗4 ⊕ B⊗2) ⊕ (C⊗8 ⊕ C⊗2 ⊕ C) ⊕ (D⊗8 ⊕ D⊗4 ⊕ D)
C' = (A⊗8 ⊕ A⊗4 ⊕ A) ⊕ (B⊗8 ⊕ B) ⊕ (C⊗8 ⊕ C⊗4 ⊕ C⊗2) ⊕ (D⊗8 ⊕ D⊗2 ⊕ D)
D' = (A⊗8 ⊕ A⊗2 ⊕ A) ⊕ (B⊗8 ⊕ B⊗4 ⊕ B) ⊕ (C⊗8 ⊕ C) ⊕ (D⊗8 ⊕ D⊗4 ⊕ D⊗2)
You will get this if you expand a Multiply(x,y)
macro for the A', B', C', D'
We have a 16 X⊗8, 8 X⊗4, 8 X⊗2, and 40 ⊕ ops.
Let's optimize something
Since (A ⊕ B) ⊕ C = A ⊕ (B ⊕ C) = A ⊕ B ⊕ C and A⊗(B ⊕ C) = (A⊗B) ⊕ (A⊗C) and A ⊕ A = 0
Let T = (A ⊕ B ⊕ C ⊕ D)⊗8 ⊕ (A ⊕ B ⊕ C ⊕ D)
Then
A' = T ⊕ (A ⊕ C)⊗4 ⊕ (A ⊕ B)⊗2 ⊕ A
B' = T ⊕ (B ⊕ D)⊗4 ⊕ (B ⊕ C)⊗2 ⊕ B
C' = T ⊕ (A ⊕ C)⊗4 ⊕ (C ⊕ D)⊗2 ⊕ C
D' = T ⊕ (B ⊕ D)⊗4 ⊕ (D ⊕ A)⊗2 ⊕ D
Let calculate T as
T = (A ⊕ B ⊕ C ⊕ D); T = T ⊕ T⊗8
And reuse (A ⊕ C)⊗4 and (B ⊕ D)⊗4
Num of ops | Old | New | Remark |
---|---|---|---|
⊗8 | 16 | 1 | xtime(xtime(xtime(X))) |
⊗4 | 8 | 2 | xtime(xtime(X)) |
⊗2 | 8 | 4 | xtime(X) |
⊕ | 40 | 18 | XOR |
Not bad, huh?
Thanks for the explanation - I was also able to follow the comments you've added in the code :) Good job
Not bad, huh?
No it's quite brilliant - thank you for contributing :)
I would still like to hear from @DamonHD w.r.t. implications for the 8-bit guys, but I think this change will be an improvement for those chips as well.
I'd say that whatever we have in our lib was what we found worked well enough*, but note that we adopted this code (the original is also checked in). I think that this is the relevant file from ours?
https://github.com/opentrv/OTAESGCM/blob/master/content/OTAESGCM/utility/OTAESGCM_OTAESGCM.cpp
Rgds
Damon
*Full packaging and encryption (or decode) of a ~63 byte frame on an ATmega328p running ~1MHz CPU is ~0.5s IIRC. We only needed to do one every couple of minutes in the most common case, and CPU was already only a tiny fraction of our energy budget, so we didn't fiddle further. Other systems' constraints will be different.
PS. I had a thought: be careful of optimising the computation if its is data- or key- dependent else you may leak bits by different run-times or power usage etc. I undid a few such reflexive optimisations once I'd thought about this issue... (Have just finished work and should give myself a break, so will not attempt to digest your code unless you shame me into doing so!)
IMO this code shouldn't be vulnerable against time-based or power consumption-based attacks because it contains no extra branches inside, just some precomputations.
IMO this code shouldn't be vulnerable against time-based or power consumption-based attacks because it contains no extra branches inside, just some precomputations.
and
PS. I had a thought: be careful of optimising the computation if its is data- or key- dependent else you may leak bits by different run-times or power usage etc. I undid a few such reflexive optimisations once I'd thought about this issue... (Have just finished work and should give myself a break, so will not attempt to digest your code unless you shame me into doing so!)
That boat sailed long ago :P Many microprocessors can't execute multiplication-instructions in constant time. On small ARM CPUs the execution time (number of cycles) for multiplication depends on how many bits are set in the operands/arguments.
From https://developer.arm.com/documentation/ddi0180/a/instruction-cycle-summary-and-interlocks/instruction-cycle-times/multiplier-cycle-counts
The number of cycles that a multiply instruction takes to complete depends on which instruction it is, and on the value of the multiplier-operand.
See https://www.bearssl.org/ctmul.html (look a bit down the page) for a more general discussion.
BUT I've read a lot on proof-of-concept timing attacks. To the best of my knowledge however, there haven't been (m)any practical attacks in the wild using this technique.
If your hardware/system mostly handles the crypto and not much else (e.g. imagine a MCU in a safe), then it could be vulnerable to having the timing and/or power measured. If you do handle other stuff (e.g. field-bus communication etc.) it's my experience that you usually introduce enough jitter to make the jitter indistinguishable from white noise -- the timing differences drown in variability from other parts of the software.
It's a super valid point though! And on "real" hardware (e.g. x86 or ARM64) you should definitely use something else than this library to get constant time primitives.
Hmm, I just tried comparing the size of the code compiled using this optimization.
It looks like the binary size increases on AVR and stays the same on ARM/Thumb.
So on AVR it doesn't look like a good optimization w.r.t. size (I would need to benchmark to test w.r.t. speed) and for ARM it doesn't make any difference. I will have to look more into this.
For AVR:
/tmp/tiny-AES-c git:(master)
$ avr-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os ; size aes.o
text data bss dec hex filename
1157 0 0 1157 485 aes.o
/tmp/tiny-AES-c git:(master)
$ cd ../tiny-AES-c-opt/
/tmp/tiny-AES-c-opt git:(feat/optimize_inv_mix)
$ avr-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os ; size aes.o
text data bss dec hex filename
1193 0 0 1193 4a9 aes.o
For ARM/Thumb
/tmp/tiny-AES-c-opt git:(feat/optimize_inv_mix)
$ arm-none-eabi-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os -mthumb ; size aes.o
text data bss dec hex filename
883 0 0 883 373 aes.o
/tmp/tiny-AES-c-opt git:(feat/optimize_inv_mix)
$ cd ../cd tiny-AES-c
/tmp/tiny-AES-c git:(master)
$ arm-none-eabi-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os -mthumb ; size aes.o
text data bss dec hex filename
883 0 0 883 373 aes.o
I tested with these versions of gcc
$ avr-gcc --version
avr-gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ arm-none-eabi-gcc --version
arm-none-eabi-gcc (15:10.3-2021.07-1) 10.3.1 20210621 (release)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
It's so strange you got a different code size for CTR on AVR. CTR doesn't use decipher. Perhaps you forgot to rebase a feature branch onto the latest master. Here is what I got with the latest master and ECB.
$ arm-none-eabi-gcc --version
rm-none-eabi-gcc (GNU Arm Embedded Toolchain 10.3-2021.07) 10.3.1 20210621 (release)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ avr-gcc --version
avr-gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ arm-none-eabi-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os -mthumb ; size aes.o
text data bss dec hex filename
1399 0 0 1399 577 aes.o
$ git checkout feat/optimize_inv_mix
Switched to branch 'feat/optimize_inv_mix'
$ arm-none-eabi-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os -mthumb ; size aes.o
text data bss dec hex filename
1299 0 0 1299 513 aes.o
$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ avr-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os ; size aes.o
text data bss dec hex filename
1593 0 0 1593 639 aes.o
$ git checkout feat/optimize_inv_mix
Switched to branch 'feat/optimize_inv_mix'
$ avr-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os ; size aes.o
text data bss dec hex filename
1471 0 0 1471 5bf aes.o
I meant to compare github.com/kokke/tiny-AES-c (branch=main) with github.com/dmitrystu/tiny-AES-c (branch=feat/optimize_inv_mix) but I might have messed up - I'll check when I get home from work ..
Thanks for checking up 👍
FYI. This commit slightly reduces code size for AVR, because size_t
on AVR is equal to uint16_t
.
BTW. Found another possible optimization for the xtime(x)
. This also guarantees constant execution time (no multiplication, nor branches).
static uint8_t xtime(uint8_t x)
{
// extend MSB to other bits using arithmetical shift right
int8_t mask = (int8_t)x >> 7;
return ((x<<1) ^ ((uint8_t)mask & 0x1b));
}
AVR:
00000058 <xtime>:
58: 98 2f mov r25, r24
5a: 99 0f add r25, r25
5c: 99 0b sbc r25, r25
5e: 9b 71 andi r25, 0x1B ; 27
60: 88 0f add r24, r24
62: 89 27 eor r24, r25
64: 08 95 ret
cortex-m0:
00000034 <xtime>:
34: 0003 movs r3, r0
36: 221b movs r2, #27
38: b240 sxtb r0, r0
3a: 11c0 asrs r0, r0, #7
3c: 4010 ands r0, r2
3e: 005b lsls r3, r3, #1
40: 4058 eors r0, r3
42: b2c0 uxtb r0, r0
44: 4770 bx lr
It seems that the AVR compiler knows this template.