hermes icon indicating copy to clipboard operation
hermes copied to clipboard

On Android pthread_attr_getstack() includes the guard pages in the returned stack size

Open thestinger opened this issue 1 year ago • 18 comments

Bug Description

64-bit ARM requires 64k as the minimum guard size since it uses this as the default stack probe size to reduce the overhead of -fstack-clash-protection compared to a 4k size. GrapheneOS uses 64k as the default guard size and adjusts a guard size below 64k to 64k similarly to glibc. Hermes handles this incorrectly and crashes with a stack overflow error. This is causing apps which update to the new version to crash on GrapheneOS, reducing app compatibility. This is a severe issue and you're requiring that operating systems have a security vulnerability in order to be compatible with Hermes.

We plan on reporting the lack of the required minimum guard size as a security vulnerability in Android, therefore it should get fixed in a monthly security update. Apps using React Native / Hermes would then break with the stock OS across devices if they update to this version. We'll be reported it soon and it usually takes them a minimum of 3 months to ship a patch, although probably longer for something like this. The issue we see is you don't control when apps update and they're currently slowly updating or starting out using a broken version which they may not upgrade from for years.

Steps To Reproduce

Run apps using Hermes on one of the supported GrapheneOS devices and notice the crash from the stack overflow detection having a false positive.

https://grapheneos.org/faq#recommended-devices https://grapheneos.org/install/web

The Expected Behavior

Should take the guard size into account properly and avoid crashing. The actual guard size that's used needs to be required rather than assuming it isn't rounded up higher than 4k.

thestinger avatar Oct 12 '24 05:10 thestinger

We're working around this by going back to a 4k stack guard temporarily. We can add detection of the Hermes library and downgrade security for apps using it in the future. It's very problematic though since the ABI says the stack probe size is 64k so even if we use 4k that doesn't mean everything will use 4k.

thestinger avatar Oct 12 '24 09:10 thestinger

I am sorry, we will be happy to fix this, but can you explain in more detail what the actual problem in Hermes is? When is it crashing and how? What is the error message? Do you have a symbolicated stack trace? Do you have a minimal repro?

tmikov avatar Oct 12 '24 17:10 tmikov

This is the stack overflow detection code that's buggy:

https://github.com/facebook/hermes/blob/0ff6c2ab52062bc6ace2a0675b5367dcca5e0a05/lib/Support/OSCompatPosix.cpp#L638 https://github.com/facebook/hermes/blob/0ff6c2ab52062bc6ace2a0675b5367dcca5e0a05/include/hermes/Support/StackOverflowGuard.h#L94

It doesn't take into account the guard size in the calculations, which isn't always 4k. It works with a 4k or 16k guard but breaks on the way to 64k. 64-bit ARM on Linux has standardized a 64k guard since it's better for security but Android hasn't moved to it from 4k. GrapheneOS uses a 64k guard, but we're working on a new release which among other changes temporarily rolls back to 4k due to this issue in Hermes. If you test on a non-end-of-life Pixel with the current release of GrapheneOS you can replicate the crash. You could also manually set a 64k guard size in your code and see that it results in a crash if you want an easy way to do it without GrapheneOS. Hermes is currently using the default guard size, which is 64k rather than 4k on GrapheneOS. However, setting it manually to 4k would also get 64k since it's a minimum value just like glibc for arm64.

thestinger avatar Oct 12 '24 17:10 thestinger

The calculation wrongly thinks the stack has overflowed and Hermes aborts, wrongly thinking there's an overflow. Do not know the exact details of how it goes wrong, just that it breaks with the default guard size set to 64k instead of 4k which is definitely a bug. It's possible it just doesn't allocate large enough stacks for a 64k guard and needs to factor it in. Not sure exactly what the code is doing, not at all familiar with it.

thestinger avatar Oct 12 '24 17:10 thestinger

I am sorry, you will have to work closely with us here, since we don't have Graphene installed. Let's try to figure out what is going on.

I am looking at the code and I don't see how the guard size is relevant at all for this check. The code is pretty simple. It obtains the stack bounds, subtracts a configurable "gap" (64KB by default) for additional safety, and checks whether the stack pointer falls within that region.

So, for the check to be failing, the stack bounds must be incorrect. How is that possible? And why does the size of the guard matter at all?

tmikov avatar Oct 12 '24 17:10 tmikov

You can explicitly set the guard size to 64k when creating the threads and it should fail in the same way.

thestinger avatar Oct 12 '24 17:10 thestinger

I just tried that on arm64 Linux: https://gist.github.com/tmikov/5cc745cf16cc35118fb459b0d4414f85. It worked as expected.

We really can't do anything more about this until we can understand what the bug is.

tmikov avatar Oct 12 '24 19:10 tmikov

It needs to be done on Android. Bionic handles things differently than glibc.

thestinger avatar Oct 12 '24 19:10 thestinger

@thestinger it appears that Bionic has a bug and it includes the guard size in the reported stack size. The problem is not in the size being 64KB specifically, but that we assume that the entire reported stack size is valid. Also, the problem appears only when there is an actual stack overflow.

We will work to fix this.

tmikov avatar Oct 13 '24 00:10 tmikov

To be clear, the problem I discovered has different symptoms than what you are describing. The problem only appears when there is an actual stack overflow. Our stack checking fails to catch it, resulting in a sigsegv. Sure that is bad, but the app was already crashing.

tmikov avatar Oct 13 '24 01:10 tmikov

That's different from what I'm describing since Bionic adds space for the stack guard so there should be just as much space available with a larger stack guard and it shouldn't cause an overflow, Also, the stack checking is what's causing the abort we see rather than a stack overflow.

thestinger avatar Oct 13 '24 03:10 thestinger

@tmikov That's not what I was reporting here. I suggest not working around how Android reports the guard page as part of pthread_attr_getstack because that can be changed. It seems like the opposite of what I'm reporting which is the stack check triggering prematurely, not too late.

thestinger avatar Oct 13 '24 03:10 thestinger

@thestinger I wrote a small test app, creating a thread with 64KB stack guard size. I tested it on MacOS, Linux and Android, all arm64. It worked normally on all platforms and did not trigger a stack overflow prematurely. I do not see how any guard size can trigger a premature stack overflow. The stack overflow check is simply a comparison of the current stack pointer with the bounds of the stack. The guard size is irrelevant in that calculation.

I especially don't see how it can cause a premature SIGSEGV.

I did however discover that, only on Android, stack overflow is not correctly detected with 64KB guard size, causing a crash instead of a controlled JS exception. This is explained fully by the difference of behavior of pthread_attr_getstack(), most likely a Bionic bug.

At the moment I am satisfied that I understand what I am seeing and I have a workaround for it. I am grateful that we caught this problem. Again, it only affects code that is experiencing an actual stack overflow, causing a SIGSEGV instead of gracefully throwing a JS exception. This is not great, but that code was already broken.

If you are seeing different kinds of crashes, there must be a different explanation. Happy to work with you on it, but we need more information on order to diagnose it. Unfortunately for now we can't install GrapheneOS on our devices, so we will have to rely on you to run some experiments.

If you can run https://gist.github.com/tmikov/fc3e91a16b9debb8877e4257a551fd08 in GrapheneOS, it will be interesting to observe the behavior in these two cases:

  • stack v guard : run native recursion with 64KB guard. checking for stack overflow
  • stack m guard: create a thread with 64KB guard and manually probe the stack

tmikov avatar Oct 13 '24 06:10 tmikov

I did however discover that, only on Android, stack overflow is not correctly detected with 64KB guard size, causing a crash instead of a controlled JS exception. This is explained fully by the difference of behavior of pthread_attr_getstack(), most likely a Bionic bug.

We know exactly why this happens but changing the behavior breaks compatibility with apps. In general, behavior like this could only be changed for a new target API level. Android itself has code in ART depending on it working the way that it does. It's because they set both the internal mmap base and stack base to the mapping base instead of offsetting the stack base. It can't be easily changed due to compatibility needing to be preserved. Could try to convince them to change it for a new target API level but they have bigger things to worry about. It shouldn't break many apps and the apps it breaks should be able to deal with it since they're messing with low level, subtle details. We cannot change this behavior though.

You could change your code to work around this Bionic behavior but then your code will break if it's changed in Bionic. I would expect that to only happen with a new target API level but they could decide to do it with a new major yearly release for all apps instead. I'd expect too much breaks to do it without the new target API level.

At the moment I am satisfied that I understand what I am seeing and I have a workaround for it. I am grateful that we caught this problem. Again, it only affects code that is experiencing an actual stack overflow, causing a SIGSEGV instead of gracefully throwing a JS exception. This is not great, but that code was already broken.

It is quite possible the app we encountered depends on catching a stack overflow. Only 1 app has been reported as broken so far and it's likely much more would have broken if it caused problems outside of extreme edge cases. React Native is extremely popular but 1 user has reported 1 app stopped working after the app was updated, and it's an obscure app.

thestinger avatar Oct 13 '24 07:10 thestinger

@tmikov It does appear to be the case that this app triggers a stack overflow and normally recovers from it, but with the guard size raised to 64k it can't recover because the stack overflow check is misaligned with what Bionic actually does and allows it to hit the guard.

thestinger avatar Oct 13 '24 12:10 thestinger

It looks a lot like the developers of this app worked around a bug by catching an exception and simply not fixing it. That's why it's the only app which broke. Some of the stuff we find with our security hardening is truly strange. Still not as weird as how tons of banking apps use the 32-bit ARM syscall ABI perhaps by accident on pure 64-bit devices as part of their anti-tampering frameworks.

thestinger avatar Oct 13 '24 12:10 thestinger

Android's implementation does not appear to violate the POSIX requirements. It requires that the address range is readable and writable for setstack, not getstack. A way to work around this without hard-wiring a specific Bionic behavior would be to configure the thread with a custom mapping with a 64k guard on both ends for arm64 and 4k for 32-bit arm so that you control the layout and can make assumptions about it. I don't think there's any hope of convincing Google to change Bionic since what they're doing is not explicitly forbidden.

thestinger avatar Oct 13 '24 12:10 thestinger

Fix in https://github.com/facebook/hermes/pull/1538

tmikov avatar Oct 14 '24 00:10 tmikov