sentry-java `pthread_join` failed for sampling thread shutdown: Invalid argument

Description

Potential crash when stopping Profiling, more context in internal customer case doc.

This is likely an issue inside of the Android Profiler that our SDK uses for Profiling, similar to https://github.com/getsentry/sentry-java/issues/2604.

The stack trace associated with this issue looks like this:

/apex/com.android.runtime/lib64/bionic/libc.so (abort+164)
/apex/com.android.art/lib64/libart.so (art::Runtime::Abort(char const*)+904)
/apex/com.android.art/lib64/libbase.so (android::base::SetAborter(std::__1::function<void (char const*)>&&)::$_0::__invoke(char const*)+80)
/apex/com.android.art/lib64/libbase.so (android::base::LogMessage::~LogMessage()+352)
/apex/com.android.art/lib64/libart.so (art::Trace::StopTracing(bool, bool)+1568)
...

Jul 10 '24 11:07 kahest

The report we got for this is from a RN app running on Android, which might be a red herring, but we should be aware. FYI @krystofwoldrich

Aug 06 '24 08:08 kahest

Quick update regarding the first stack trace:

What happened here is that

the Java side issued to stop tracing
but when Trace::StopTracing() wants to join the sampling thread (this is where it aborts [1]), it was either already joined or ran detached (because we got an "Invalid Argument" == EINVAL) [2].
It cannot run detached because it would only be detached if thread creation failed [3], in this case, we'd also get an abort when invoking Trace::Start() [4].
An already joined sampling thread would mean another thread invoked Trace::StopTracing() before the previous call reset the the_trace_ and sample_thread_ states [5], which act as early exit guards when entering the function.

This differs from #2604 because pthread_join can find the thread in Bionic's global thread list. However, the join failed because the thread was already joined, probably because the second thread tried to stop tracing between guards, preventing another attempt to join and reset the guarded state at the end of the function.

What could happen concretely (just reading from the code):

T1 enters Trace::StopTracing() and achieves to join the sampling thread without contention. The sampling thread is now in the join-state THREAD_JOINED
T1 no longer holds the trace_lock_.
T2 enters Trace::StopTracing(), acquires the trace_lock_ blocking trace buffer flushing and reset of the_trace_ and sample_thread_ in T1
T2 passes the guards, leaves the trace_lock_ scope and calls pthread_join causing the abort.
T1 acquires the trace_lock_, but can no longer prevent T2 from calling pthread_join

Of course, this can only happen if two threads can call Trace::StopTracing() concurrently, and this isn't prevented on a higher level. But given the already present locks, I imagine this is not the case (another hint is the existence of the num_trace_stops_initiated_ counter).

There can be good reasons why the function doesn't hold the trace_lock_ over the entire scope, but I didn't dive into that further. Maybe this can be fixed by checking the_trace_->stop_tracing_ == true in the first lock scope and returning because it means a "stop" is already in progress.

Another interesting aspect is that sampling_pthread == 0U only means that pthread_join() isn't called, but the execution of the remaining function proceeds. In that case, it might be better to tie the reset of sampling_pthread to pthread_join() (which previously was the case [6]).

Please keep in mind that I am referencing solely the main branch here, and there could certainly be additional ways in which this behavior could emerge if we knew exactly which versions of bionic and the runtime were in use in the crash cases. This is not the investigation being done here 😅. I still think this makes sense because we know where and why this probably happened and the expected behavior in the latest state of the implementation.

References to follow along:

[1] https://cs.android.com/android//android/platform/art/+/main:runtime/trace.cc;l=914-919 [2] https://cs.android.com/android/platform/superproject/main/+/main:bionic/libc/bionic/pthread_join.cpp;l=53-55 [3] https://cs.android.com/android/platform/superproject/main/+/main:bionic/libc/bionic/pthread_create.cpp;l=476-485 [4] https://cs.android.com/android//android/platform/art/+/main:runtime/trace.cc;l=858-860 [5] https://cs.android.com/android//android/platform/art/+/main:runtime/trace.cc;l=968-969 [6] https://cs.android.com/android//android/platform/art/+/71cb074032f1b42349df77fd9d143ff5971b87ae:runtime/trace.cc;dlc=b0bdf29f81eda70342e898ec151d1165e5a758f0

Aug 08 '24 10:08 supervacuus

I was able to reproduce the issue using a vanilla Android app template (without sentry) that pretty much validates my "concrete" race described above. I used an Android 14 image in the emulator without Google Play to root and (Frida) trace the ART method tracer, too.

Essentially, the following code is enough to trigger the mentioned race, which leads to the same stack trace and "cause" message as seen coming from the Google error reporting (using the default params from package io.sentry.android.core.AndroidProfiler):

Debug.startMethodTracingSampling(null, 3_000_000, 30_000);
Thread.sleep(1000)
val threads = List(10) {
    thread(start = false) {
        Debug.stopMethodTracing()
    }
}
threads.forEach { it.start() }
threads.forEach { it.join() }

I also got another stack trace that is still a pthread_join race in Trace::StopTracing() but curiously ends up aborting from __pthread_internal_find() (like #2604 but with another cause message, can't explain why rn):

While this repro code is not something we'll find in the wild, there is a practical chance that multiple threads will stop tracing concurrently via "scope-closing" abstractions (think spans/txn). The runtime shouldn't crash in these cases but rather log an error (like it does when stop is called on an inactive trace).

I already prepped a patch for the runtime, but I can't send it for review to AOSP Gerrit before my vacation (and given the potential review back-and-forth, it doesn't make sense either). Maybe I can test the patched runtime against the repro before sending it for review. In any case, will do this when I am back.

Aug 09 '24 07:08 supervacuus

I already prepped a patch for the runtime, but I can't send it for review to AOSP Gerrit before my vacation

That's awesome! Even if you do that though, it means profiling is practically unusable until the new OS update ships with your patch included, which is unfortunate. Could you share the patch here, so we could think of applying it via something like https://github.com/bytedance/android-inline-hook for the older OS versions?

Aug 09 '24 08:08 romtsn

that's great! however, in our Android SDK we call the Debug.startMethodTracingSampling and Debug.stopMethodTracing() inside synchronized methods in the AndroidProfiler class, and this should prevent this case when using the Sentry SDK. Would you say it's still possible this happens when the thread.join is called in the native layer, given no other calls to Debug.stopMethodTracing is made by the user?

Aug 09 '24 08:08 stefanosiano

that's great! however, in our Android SDK we call the Debug.startMethodTracingSampling and Debug.stopMethodTracing() inside synchronized methods in the AndroidProfiler class, and this should prevent this case when using the Sentry SDK.

Yes, I've seen that, but since Debug.stopMethodTracing() is a static method that acts on global trace data, nothing prevents code outside the SDK from calling it concurrently to AndroidProfiler.endAndCollect(). Of course, this is something that can be documented in red letters, and Sentry can ask affected customers to look in their code or their dependencies.

Would you say it's still possible this happens when the thread.join is called in the native layer, given no other calls to Debug.stopMethodTracing is made by the user?

In theory, yes, but in contrast to #2604, I would consider it much less likely. In #2604, we don't know whether a particular thread is missing or just any arbitrary thread and whether these threads were created from native code that forgot to detach (all that is still an assumption).

Here, the cause is always the already joined sampling thread of the ART method tracer. While it is possible that some native code would join the sampling thread out of the blue (it is neither created from native client code nor typically interacting with it), I think some other Java code to call Debug.stopMethodTracing() is much more likely.

Aug 11 '24 14:08 supervacuus

I already prepped a patch for the runtime, but I can't send it for review to AOSP Gerrit before my vacation

That's awesome! Even if you do that though, it means profiling is practically unusable until the new OS update ships with your patch included, which is unfortunate.

It is certainly not great, but I wouldn't go so far as to call it unusable. I think, and please correct me if that assumption is wrong, that most SDK clients will not interact with the method tracing directly (or via one of their other dependencies).

As @stefanosiano mentioned, having synchronized invocations to Debug.stopMethodTracing() entirely eliminates the problem within the SDK. I think that having this behavior documented in an alert box with all features that end up calling AndroidProfiler.endAndCollect() and asking affected users for other potential uses of the ART method tracer in their code would go a long way in preventing this issue from happening.

Could you share the patch here, so we could think of applying it via something like https://github.com/bytedance/android-inline-hook for the older OS versions?

Yes, of course, but please let me test the code before. Adding the fix via hooks won't be trivial since it changes the control flow based on the private state, and I can't think of some blunt wrapping behavior. But let's talk details when I am back.

Aug 11 '24 14:08 supervacuus

report on the ART issue-tracker: https://issuetracker.google.com/issues/362293861

Aug 27 '24 16:08 supervacuus

moving this to blocked while we await feedback from Google

Sep 05 '24 14:09 kahest

Update: there is some movement on the ART issue tracker: https://issuetracker.google.com/issues/362293861#comment6 Our tests indicate that the fix mentioned in the comment fixes the reproducer we created for this issue

Sep 19 '24 13:09 kahest

Closing this as it has been fixed by Google. Follow https://issuetracker.google.com/issues/362293861 for updates on rollout status.

Dec 04 '24 14:12 kahest