libtomcrypt icon indicating copy to clipboard operation
libtomcrypt copied to clipboard

Inappropriate implementation of `LTC_CLEAN_STACK` using `burn_stack()` api

Open b49020 opened this issue 5 years ago • 20 comments

Description

Some parts of LibTomCrypt have been imported in an open source project OP-TEE [1] for implementing crypto in software (imported code can be found here [2]). But recently we have enabled LTC_CLEAN_STACK so that LibTomCrypt wipes out key material and other sensitive data once no longer used.

Given stack space is quiet expensive for optee_os and currently its limited to 2k only. So while testing on ARM64 machine with LTC_CLEAN_STACK enabled, burn_stack() behaved inappropriately burning stack space 2 times the actual requirement leading to corruption of stack canaries.

Looking deeper into burn_stack() api [3], it seems to be a recursive function which assumes that only 32 bytes of stack space is utilized in every recursive call. But this assumption is incorrect as it doesn't take into account the usage of stack space during function calls. Specifically for function return pointer, stack frame pointer, any padding compiler introduces etc.

[1] https://optee.readthedocs.io/ [2] https://github.com/OP-TEE/optee_os/tree/master/core/lib/libtomcrypt/src [3] https://github.com/libtom/libtomcrypt/blob/develop/src/misc/burn_stack.c

Steps to Reproduce

Simply add a debug print to dump address of 32 bytes buffer kept on stack in burn_stack() api as follows:

 void burn_stack(unsigned long len)
 {
    unsigned char buf[32];
+   printf("buf addr: %p\n", buf);
    zeromem(buf, sizeof(buf));
    if (len > (unsigned long)sizeof(buf)) {
       burn_stack(len - sizeof(buf));
    }
 }

Following are buf addr dumps which are taken on ARM64 machine where burn_stack() api is used to clean stack for _sha512_compress() api. Size to be cleared: 724 bytes but actual size cleared from below dump: 1472 bytes.

buf addr: 0xfc0b6380
buf addr: 0xfc0b6340
buf addr: 0xfc0b6300
buf addr: 0xfc0b62c0
buf addr: 0xfc0b6280
buf addr: 0xfc0b6240
buf addr: 0xfc0b6200
buf addr: 0xfc0b61c0
buf addr: 0xfc0b6180
buf addr: 0xfc0b6140
buf addr: 0xfc0b6100
buf addr: 0xfc0b60c0
buf addr: 0xfc0b6080
buf addr: 0xfc0b6040
buf addr: 0xfc0b6000
buf addr: 0xfc0b5fc0
buf addr: 0xfc0b5f80
buf addr: 0xfc0b5f40
buf addr: 0xfc0b5f00
buf addr: 0xfc0b5ec0
buf addr: 0xfc0b5e80
buf addr: 0xfc0b5e40
buf addr: 0xfc0b5e00

Version

This could be reproduced on latest upstream develop branch.

Suggestion

We would suggest to clean stack in corresponding function only where it is being used. As it seems one can't assume anything about what stack space has been consumed by the compiler for a given function call.

b49020 avatar Jun 06 '19 08:06 b49020

Looking at reason behind introduction of burn_stack() api [1], it seems like in case of a lot of stack variables, clearing every variable inside function makes it look bit messier and may be bit slow due to multiple zeromem calls.

So I have tried to come up with below approach to clear stack in corresponding function only and avoid multiple zeromem calls:

Approach 1

void *min_addr(int arg_count, ...)
{
   int i;
   void *min, *a;
   va_list ap;

   va_start(ap, arg_count);
   min = va_arg(ap, void *);

   for(i = 2; i <= arg_count; i++) {
     if((a = va_arg(ap, void *)) < min)
       min = a;
   }

   va_end(ap);
   return min;
}

static int  sha512_compress(hash_state * md, unsigned char *buf)
{
     ulong64 S[8], W[80], t0, t1;
     int i;
<snip>
#ifdef LTC_CLEAN_STACK
    zeromem(min_addr(5, S, W, &t0, &t1, &i),
            sizeof(ulong64) * 90 + sizeof(int));
#endif
    return CRYPT_OK;
}

Above approach works if compiler allocates stack variables at contiguous stack memory locations which might not be true in case padding involved which could lead to some stack memory uncleaned (equal to size of padding).

Approach 2

Multiple zeromem calls:

static int  sha512_compress(hash_state * md, unsigned char *buf)
{
     ulong64 S[8], W[80], t0, t1;
     int i;
<snip>
#ifdef LTC_CLEAN_STACK
    zeromem(S, sizeof(S));
    zeromem(W, sizeof(W));
    zeromem(&t0, sizeof(t0));
    zeromem(&t1, sizeof(t1));
    zeromem(&i, sizeof(i));
#endif
    return CRYPT_OK;
}

I would be happy to hear any feedback/suggestions regarding which of above approach seems most suitable.

[1] https://github.com/libtom/libtomcrypt/blob/3fb0eea01b181a2616407a36eadb67af3b964ffb/changes#L1283

b49020 avatar Jun 06 '19 13:06 b49020

You could also stick everything in a struct...

static int  sha512_compress(hash_state * md, unsigned char *buf)
{
    struct {
        ulong64 S[8], W[80], t0, t1;
        int i;
    } sensitive;
#define S sensitive.S
#define W sensitive.W
...
#ifdef LTC_CLEAN_STACK
    zeromem(&sensitive, sizeof(sensitive));
#endif
    return CRYPT_OK;
}

ksherlock avatar Jun 06 '19 20:06 ksherlock

I prefer "Approach 2" as it keeps it simple without making any assumptions about the stack layout.

Regarding performance I think either approach will beat the old burnstack approach.

jenswi-linaro avatar Jun 07 '19 06:06 jenswi-linaro

@ksherlock Performance wise approach suggested by you looks better than "Approach 2". The only thing that doesn't looks good is a define for every variable which makes function look a bit messier even if compared with "Approach 2".

b49020 avatar Jun 07 '19 06:06 b49020

Any more preferences among above 3 approaches (Approach 3: suggested by @ksherlock)? Or any further suggestions to address this issue?

b49020 avatar Jun 10 '19 07:06 b49020

i like the zeromem struct approach best. i wonder why there's not been any comment by @sjaeckel or @karel-m since this seems to be a serious issue.

rofl0r avatar Jun 10 '19 16:06 rofl0r

i wonder why there's not been any comment by @sjaeckel or @karel-m since this seems to be a serious issue.

because I'm not certain yet on how to solve it best...

I like approach 3, but prefer 2 for its simplicity and no macro magic

I already thought about an approach 4 where we leave the burn_stack() approach but increase the buffer size, so the overshooting isn't that severe.

sjaeckel avatar Jun 10 '19 16:06 sjaeckel

I already thought about an approach 4 where we leave the burn_stack() approach but increase the buffer size, so the overshooting isn't that severe.

What should be the appropriate buffer size? Since we can't have too big buffer for APIs that only need to clean few bytes and too small buffer that caused this overshoot. Overall this approach still seems to be fragile.

b49020 avatar Jun 11 '19 05:06 b49020

I prefer #2 for its simplicity.

buggywhip avatar Jun 11 '19 05:06 buggywhip

@karel-m 2 fine for you as well?

sjaeckel avatar Jun 11 '19 06:06 sjaeckel

I do not use LTC_CLEAN_STACK in my builds. Approach 2 is fine.

karel-m avatar Jun 12 '19 08:06 karel-m

Thanks everyone for your feedback. Will use "Approach 2" to create a fix patch.

b49020 avatar Jun 12 '19 08:06 b49020

Does anyone think it's worth to keep the old approach alive? (probably as LTC_BURN_STACK)

sjaeckel avatar Jun 12 '19 08:06 sjaeckel

Does anyone think it's worth to keep the old approach alive? (probably as LTC_BURN_STACK)

if the initial analysis posted by OP is correct, the old approach uses undefined behaviour. i don't think that should be kept.

Will use "Approach 2" to create a fix patch.

explicit zeromem calls for every single (sensitive) local variable seems to be an approach that leads to bigger functions (both in code as in binary), as well as error-prone because the programmer needs to manually take care of every single variable. if we instead stuff everything into a struct, there's only a single call to zeromem, and one can simply put new vars into the struct and forget about adding another call with the right arguments.

rofl0r avatar Jun 12 '19 16:06 rofl0r

It seems likely that we can't get agreement on one of above discussed approaches. So how about following simple change to use variable length array instead of recursive call in burn_stack api?

diff --git a/core/lib/libtomcrypt/src/misc/burn_stack.c b/core/lib/libtomcrypt/src/misc/burn_stack.c
index c5ae8d3..1722be0 100644
--- a/core/lib/libtomcrypt/src/misc/burn_stack.c
+++ b/core/lib/libtomcrypt/src/misc/burn_stack.c
@@ -49,10 +49,8 @@
 */
 void burn_stack(unsigned long len)
 {
-   unsigned char buf[32];
+   unsigned char buf[len];
    zeromem(buf, sizeof(buf));
-   if (len > (unsigned long)sizeof(buf))
-      burn_stack(len - sizeof(buf));
 }

It seems to resolve stack overflow issue and does stack clean operation too. This change is something that was suggested by @daniel-thompson.

b49020 avatar Jun 27 '19 08:06 b49020

Except that it still relies on undefined behavior.

jenswi-linaro avatar Jun 27 '19 09:06 jenswi-linaro

using vla's is the worst possible choice. there have been countless numbers of CVE's due to it. also it doesn't make the code simpler at all, now every call site has to calculate correctly the length passed, which asks for trouble (whenever you add a variable, you have to adjust the burn_stack calls).

rofl0r avatar Jun 27 '19 14:06 rofl0r

@rofl0r: I agree that in general VLAs should be avoided... but I don't agree that they are the worst possible choice since I think that spot is already occupied by the existing burn_stack() code ;-) !

I've got no skin in the game here but FWIW I like approach 3 best, either with macros or just making it such a common idiomatic construct across the library that nobody minds giving it a single letter variable name.

daniel-thompson avatar Jun 27 '19 15:06 daniel-thompson

@rofl0r Agree but with a minimal change like VLA, we could improve the burn_stack() API atleast. It was just an effort to make current code workable rather than making developer's life easier.

Anyway, let me try to work on a common variable declaration construct that may make approach 3 look more better.

b49020 avatar Jun 28 '19 06:06 b49020

Sorry for the delay, but I was busy in the last weeks.

if the initial analysis posted by OP is correct, the old approach uses undefined behaviour. i don't think that should be kept.

If that's the case then I also think that the entire burn_stack() approach as it is implemented now should be removed, but can you please explain why this is UB?

The VLA approach would indeed be the least intrusive change, but then it uses VLA's which officially exist only since c99 ... so a clear no.

now every call site has to calculate correctly the length passed, which asks for trouble (whenever you add a variable, you have to adjust the burn_stack calls).

that's already the case, so this wouldn't change ;-)

but that's also one of the main reasons why I like approach 3, automatic determination of the size of what to clear.

Anyway, let me try to work on a common variable declaration construct that may make approach 3 look more better.

I'm looking forward to an improved version of approach 3!

sjaeckel avatar Jun 30 '19 09:06 sjaeckel