libtomcrypt
libtomcrypt copied to clipboard
Inappropriate implementation of `LTC_CLEAN_STACK` using `burn_stack()` api
Description
Some parts of LibTomCrypt
have been imported in an open source project OP-TEE
[1] for implementing crypto in software (imported code can be found here [2]). But recently we have enabled LTC_CLEAN_STACK
so that LibTomCrypt
wipes out key material and other sensitive data once no longer used.
Given stack space is quiet expensive for optee_os
and currently its limited to 2k only. So while testing on ARM64 machine with LTC_CLEAN_STACK
enabled, burn_stack()
behaved inappropriately burning stack space 2 times the actual requirement leading to corruption of stack canaries.
Looking deeper into burn_stack()
api [3], it seems to be a recursive function which assumes that only 32 bytes of stack space is utilized in every recursive call. But this assumption is incorrect as it doesn't take into account the usage of stack space during function calls. Specifically for function return pointer, stack frame pointer, any padding compiler introduces etc.
[1] https://optee.readthedocs.io/ [2] https://github.com/OP-TEE/optee_os/tree/master/core/lib/libtomcrypt/src [3] https://github.com/libtom/libtomcrypt/blob/develop/src/misc/burn_stack.c
Steps to Reproduce
Simply add a debug print to dump address of 32 bytes buffer kept on stack in burn_stack()
api as follows:
void burn_stack(unsigned long len)
{
unsigned char buf[32];
+ printf("buf addr: %p\n", buf);
zeromem(buf, sizeof(buf));
if (len > (unsigned long)sizeof(buf)) {
burn_stack(len - sizeof(buf));
}
}
Following are buf addr
dumps which are taken on ARM64 machine where burn_stack()
api is used to clean stack for _sha512_compress()
api. Size to be cleared: 724 bytes but actual size cleared from below dump: 1472 bytes.
buf addr: 0xfc0b6380
buf addr: 0xfc0b6340
buf addr: 0xfc0b6300
buf addr: 0xfc0b62c0
buf addr: 0xfc0b6280
buf addr: 0xfc0b6240
buf addr: 0xfc0b6200
buf addr: 0xfc0b61c0
buf addr: 0xfc0b6180
buf addr: 0xfc0b6140
buf addr: 0xfc0b6100
buf addr: 0xfc0b60c0
buf addr: 0xfc0b6080
buf addr: 0xfc0b6040
buf addr: 0xfc0b6000
buf addr: 0xfc0b5fc0
buf addr: 0xfc0b5f80
buf addr: 0xfc0b5f40
buf addr: 0xfc0b5f00
buf addr: 0xfc0b5ec0
buf addr: 0xfc0b5e80
buf addr: 0xfc0b5e40
buf addr: 0xfc0b5e00
Version
This could be reproduced on latest upstream develop
branch.
Suggestion
We would suggest to clean stack in corresponding function only where it is being used. As it seems one can't assume anything about what stack space has been consumed by the compiler for a given function call.
Looking at reason behind introduction of burn_stack()
api [1], it seems like in case of a lot of stack variables, clearing every variable inside function makes it look bit messier and may be bit slow due to multiple zeromem
calls.
So I have tried to come up with below approach to clear stack in corresponding function only and avoid multiple zeromem
calls:
Approach 1
void *min_addr(int arg_count, ...)
{
int i;
void *min, *a;
va_list ap;
va_start(ap, arg_count);
min = va_arg(ap, void *);
for(i = 2; i <= arg_count; i++) {
if((a = va_arg(ap, void *)) < min)
min = a;
}
va_end(ap);
return min;
}
static int sha512_compress(hash_state * md, unsigned char *buf)
{
ulong64 S[8], W[80], t0, t1;
int i;
<snip>
#ifdef LTC_CLEAN_STACK
zeromem(min_addr(5, S, W, &t0, &t1, &i),
sizeof(ulong64) * 90 + sizeof(int));
#endif
return CRYPT_OK;
}
Above approach works if compiler allocates stack variables at contiguous stack memory locations which might not be true in case padding involved which could lead to some stack memory uncleaned (equal to size of padding).
Approach 2
Multiple zeromem
calls:
static int sha512_compress(hash_state * md, unsigned char *buf)
{
ulong64 S[8], W[80], t0, t1;
int i;
<snip>
#ifdef LTC_CLEAN_STACK
zeromem(S, sizeof(S));
zeromem(W, sizeof(W));
zeromem(&t0, sizeof(t0));
zeromem(&t1, sizeof(t1));
zeromem(&i, sizeof(i));
#endif
return CRYPT_OK;
}
I would be happy to hear any feedback/suggestions regarding which of above approach seems most suitable.
[1] https://github.com/libtom/libtomcrypt/blob/3fb0eea01b181a2616407a36eadb67af3b964ffb/changes#L1283
You could also stick everything in a struct...
static int sha512_compress(hash_state * md, unsigned char *buf)
{
struct {
ulong64 S[8], W[80], t0, t1;
int i;
} sensitive;
#define S sensitive.S
#define W sensitive.W
...
#ifdef LTC_CLEAN_STACK
zeromem(&sensitive, sizeof(sensitive));
#endif
return CRYPT_OK;
}
I prefer "Approach 2" as it keeps it simple without making any assumptions about the stack layout.
Regarding performance I think either approach will beat the old burnstack approach.
@ksherlock Performance wise approach suggested by you looks better than "Approach 2". The only thing that doesn't looks good is a define
for every variable which makes function look a bit messier even if compared with "Approach 2".
Any more preferences among above 3 approaches (Approach 3: suggested by @ksherlock)? Or any further suggestions to address this issue?
i like the zeromem struct approach best. i wonder why there's not been any comment by @sjaeckel or @karel-m since this seems to be a serious issue.
i wonder why there's not been any comment by @sjaeckel or @karel-m since this seems to be a serious issue.
because I'm not certain yet on how to solve it best...
I like approach 3, but prefer 2 for its simplicity and no macro magic
I already thought about an approach 4 where we leave the burn_stack()
approach but increase the buffer size, so the overshooting isn't that severe.
I already thought about an approach 4 where we leave the burn_stack() approach but increase the buffer size, so the overshooting isn't that severe.
What should be the appropriate buffer size? Since we can't have too big buffer for APIs that only need to clean few bytes and too small buffer that caused this overshoot. Overall this approach still seems to be fragile.
I prefer #2 for its simplicity.
@karel-m 2 fine for you as well?
I do not use LTC_CLEAN_STACK
in my builds. Approach 2 is fine.
Thanks everyone for your feedback. Will use "Approach 2" to create a fix patch.
Does anyone think it's worth to keep the old approach alive? (probably as LTC_BURN_STACK
)
Does anyone think it's worth to keep the old approach alive? (probably as LTC_BURN_STACK)
if the initial analysis posted by OP is correct, the old approach uses undefined behaviour. i don't think that should be kept.
Will use "Approach 2" to create a fix patch.
explicit zeromem calls for every single (sensitive) local variable seems to be an approach that leads to bigger functions (both in code as in binary), as well as error-prone because the programmer needs to manually take care of every single variable. if we instead stuff everything into a struct, there's only a single call to zeromem, and one can simply put new vars into the struct and forget about adding another call with the right arguments.
It seems likely that we can't get agreement on one of above discussed approaches. So how about following simple change to use variable length array instead of recursive call in burn_stack
api?
diff --git a/core/lib/libtomcrypt/src/misc/burn_stack.c b/core/lib/libtomcrypt/src/misc/burn_stack.c
index c5ae8d3..1722be0 100644
--- a/core/lib/libtomcrypt/src/misc/burn_stack.c
+++ b/core/lib/libtomcrypt/src/misc/burn_stack.c
@@ -49,10 +49,8 @@
*/
void burn_stack(unsigned long len)
{
- unsigned char buf[32];
+ unsigned char buf[len];
zeromem(buf, sizeof(buf));
- if (len > (unsigned long)sizeof(buf))
- burn_stack(len - sizeof(buf));
}
It seems to resolve stack overflow issue and does stack clean operation too. This change is something that was suggested by @daniel-thompson.
Except that it still relies on undefined behavior.
using vla's is the worst possible choice. there have been countless numbers of CVE's due to it. also it doesn't make the code simpler at all, now every call site has to calculate correctly the length passed, which asks for trouble (whenever you add a variable, you have to adjust the burn_stack calls).
@rofl0r: I agree that in general VLAs should be avoided... but I don't agree that they are the worst possible choice since I think that spot is already occupied by the existing burn_stack() code ;-) !
I've got no skin in the game here but FWIW I like approach 3 best, either with macros or just making it such a common idiomatic construct across the library that nobody minds giving it a single letter variable name.
@rofl0r Agree but with a minimal change like VLA, we could improve the burn_stack()
API atleast. It was just an effort to make current code workable rather than making developer's life easier.
Anyway, let me try to work on a common variable declaration construct that may make approach 3
look more better.
Sorry for the delay, but I was busy in the last weeks.
if the initial analysis posted by OP is correct, the old approach uses undefined behaviour. i don't think that should be kept.
If that's the case then I also think that the entire burn_stack()
approach as it is implemented now should be removed, but can you please explain why this is UB?
The VLA approach would indeed be the least intrusive change, but then it uses VLA's which officially exist only since c99 ... so a clear no.
now every call site has to calculate correctly the length passed, which asks for trouble (whenever you add a variable, you have to adjust the burn_stack calls).
that's already the case, so this wouldn't change ;-)
but that's also one of the main reasons why I like approach 3, automatic determination of the size of what to clear.
Anyway, let me try to work on a common variable declaration construct that may make
approach 3
look more better.
I'm looking forward to an improved version of approach 3!