tor-android icon indicating copy to clipboard operation
tor-android copied to clipboard

Destroying TorService a second time causes crash

Open grote opened this issue 2 years ago • 42 comments

While how to stop TorService is not documented, one usually stops a bound service simply by unbinding from it.

This works fine the first time. But if I need to start and bind to TorService a second time, I reproducibly get a hard crash in libc when unbinding again:

    --------- beginning of crash
A/libc: fdsan: attempted to close file descriptor 91, expected to be unowned, actually owned by unique_fd 0x7b7962c254
A/libc: Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 17309 (tor), pid 17187 (onshare.android)
2021-11-25 15:47:57.180 17543-17543/? I/crash_dump64: obtaining output fd from tombstoned, type: kDebuggerdTombstoneProto
2021-11-25 15:47:57.181 711-711/? I/tombstoned: received crash request for pid 17527
2021-11-25 15:47:57.183 17543-17543/? I/crash_dump64: performing dump of process 17436 (target tid = 17527)

grote avatar Nov 25 '21 18:11 grote

This may or may not be related, but it seems that you don't keep a reference to the FileDescriptor for the control socket and therefore you are never closing it. Not when exceptions occur and not when the service gets destroyed.

grote avatar Nov 25 '21 18:11 grote

ah, that sounds like a good catch! I could never reliably reproduce it to troubleshoot it.

eighthave avatar Nov 25 '21 19:11 eighthave

seems like keeping a reference, then closing it in onDestroy() could work

eighthave avatar Nov 25 '21 19:11 eighthave

Here is also some related info and debugging: https://gitlab.torproject.org/tpo/core/tor/-/issues/32729

eighthave avatar Nov 25 '21 20:11 eighthave

Background info about fdsan:

fdsan is a file descriptor sanitizer added to Android in API level 29. ... detects mishandling of file descriptor ownership, which tend to manifest as use-after-close and double-close.

grote avatar Nov 25 '21 21:11 grote

I don't think ParcelFileDescriptor is in use at all here. That's for serializing things for Intents.

eighthave avatar Nov 25 '21 21:11 eighthave

Seems like this might need to be handled in jtorctl, since it's TorControlConnection is what is using those streams.

eighthave avatar Nov 25 '21 22:11 eighthave

If that commit works, there should be a build available in the artfacts in this job, once it completes: https://gitlab.com/eighthave/tor-android/-/pipelines/416708252

eighthave avatar Nov 25 '21 22:11 eighthave

If that commit works, there should be a build available in the artfacts in this job, once it completes: https://gitlab.com/eighthave/tor-android/-/pipelines/416708252

If you close the control connection streams there, won't this interfere with the ability to use the control connection?

I pushed an extra test here that allows me to reproduce the problem: https://github.com/grote/tor-android/commit/04d249d9538a2dd5446749f3e3bc039c20228966#diff-169d30aebc4b72de0419cb79771f8ced07f96a9d6353f17fd2bd86f5d3fe1f8cR1 It does nothing more then binding and unbinding the service twice.

Again, I urge you to read the fdsan docs: https://android.googlesource.com/platform/bionic/+/master/docs/fdsan.md This has nothing to do with ParcelFileDescriptor, but is what is causing the crash.

grote avatar Nov 26 '21 12:11 grote

For reference: The native file descriptor code is here: https://github.com/guardianproject/tor/blob/3b6987c9df1f15621866898babbb96e7a010e1e8/src/feature/api/org_torproject_jni_TorService.c#L269

grote avatar Nov 26 '21 13:11 grote

I won't have time to dig into this for a bit. If you want to try the naive closing, it built here: https://gitlab.com/eighthave/tor-android/-/pipelines/416958111

eighthave avatar Nov 26 '21 13:11 eighthave

That test case looks very promising!

eighthave avatar Nov 26 '21 13:11 eighthave

Full backtrace follows below.

Looks like a double close:

libc.so (android_fdsan_close_with_tag+752) libc.so (close+16) libtor.so (tor_close_socket_simple+28) libtor.so (tor_close_socket+32)

A:       #00 pc 0000000000053bb0  /apex/com.android.runtime/lib64/bionic/libc.so (fdsan_error(char const*, ...)+556) (BuildId: ba489d4985c0cf173209da67405662f9)
A:       #01 pc 00000000000538bc  /apex/com.android.runtime/lib64/bionic/libc.so (android_fdsan_close_with_tag+752) (BuildId: ba489d4985c0cf173209da67405662f9)
A:       #02 pc 0000000000054028  /apex/com.android.runtime/lib64/bionic/libc.so (close+16) (BuildId: ba489d4985c0cf173209da67405662f9)
A:       #03 pc 000000000017fa60  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (tor_close_socket_simple+28)
A:       #04 pc 000000000017fae8  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (tor_close_socket+32)
A:       #05 pc 00000000001f2728  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so
A:       #06 pc 00000000001f9da4  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (connection_free_all+140)
A:       #07 pc 0000000000148dc0  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (tor_free_all+100)
A:       #08 pc 0000000000148d30  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (tor_cleanup+184)
A:       #09 pc 0000000000107968  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (tor_run_main+1192)
A:       #10 pc 0000000000105a40  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (Java_org_torproject_jni_TorService_runMain+36)
A:       #11 pc 0000000000222244  /apex/com.android.art/lib64/libart.so (art_quick_generic_jni_trampoline+148) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #12 pc 0000000000218964  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+548) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #13 pc 0000000000284208  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+188) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #14 pc 00000000003e31cc  /apex/com.android.art/lib64/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread*, art::ArtMethod*, art::ShadowFrame*, unsigned short, art::JValue*)+400) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #15 pc 00000000003de040  /apex/com.android.art/lib64/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+804) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #16 pc 0000000000755724  /apex/com.android.art/lib64/libart.so (MterpInvokeDirect+1036) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #17 pc 0000000000203914  /apex/com.android.art/lib64/libart.so (mterp_op_invoke_direct+20) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #18 pc 0000000000275eec  [anon:dalvik-classes11.dex extracted in memory from /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/base.apk!classes11.dex]
A:       #19 pc 00000000003d5e70  /apex/com.android.art/lib64/libart.so (art::interpreter::Execute(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame&, art::JValue, bool, bool)+304) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #20 pc 000000000073adbc  /apex/com.android.art/lib64/libart.so (artQuickToInterpreterBridge+776) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #21 pc 0000000000222378  /apex/com.android.art/lib64/libart.so (art_quick_to_interpreter_bridge+88) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #22 pc 0000000000211608  /apex/com.android.art/lib64/libart.so (nterp_helper+152) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #23 pc 0000000000275c9e  [anon:dalvik-classes11.dex extracted in memory from /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/base.apk!classes11.dex]
A:       #24 pc 0000000000218964  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+548) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #25 pc 0000000000284208  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+188) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #26 pc 000000000061f4d4  /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeVirtualOrInterfaceWithJValues<art::ArtMethod*>(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, art::ArtMethod*, jvalue const*)+460) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #27 pc 000000000066e0b8  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1184) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #28 pc 00000000000b1910  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+264) (BuildId: ba489d4985c0cf173209da67405662f9)
A:       #29 pc 00000000000513f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: ba489d4985c0cf173209da67405662f9)
E: Tombstone written to: tombstone_23

grote avatar Nov 26 '21 14:11 grote

Wonder why this isn't happening in Orbot..

n8fr8 avatar Nov 30 '21 14:11 n8fr8

@grote can you tell what is calling close each time?

@asn-d6 @ahf this seems to crash the same way as https://gitlab.torproject.org/tpo/core/tor/-/issues/32729 and @grote has put together a repro test case. The stack trace looks different though, so I'm not sure it is 100% related.

eighthave avatar Nov 30 '21 14:11 eighthave

Wonder why this isn't happening in Orbot..

I wonder as well. Maybe it isn't really stopping the service? I noticed that unbinding isn't enough to stop it when running a foreground service.

can you tell what is calling close each time?

No, I just got

libc.so (android_fdsan_close_with_tag+752)
libc.so (close+16)
libtor.so (tor_close_socket_simple+28)
libtor.so (tor_close_socket+32)

as posted in https://github.com/guardianproject/tor-android/issues/57#issuecomment-980036924 . So maybe Tor is closing it and then something in Android is trying as well to close it?

@grote has put together a repro test case.

Has anyone been able to reproduce the issue on an API 29+ device? Either with an app or that instrumentation test? Knowing if others can reproduce it, could be the first step to solving it.

grote avatar Nov 30 '21 18:11 grote

Has anyone been able to reproduce the issue on an API 29+ device? Either with an app or that instrumentation test? Knowing if others can reproduce it, could be the first step to solving it.

I can reproduce the crash with @grote's test on the Redmi Note 7 (API 29).

--------- beginning of crash
12-01 13:01:57.342 20633 20772 F libc    : Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 20772 (tor), pid 20633 (roject.jni.test)
12-01 13:01:57.354 20773 20773 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
12-01 13:01:57.354 20773 20773 F DEBUG   : Build fingerprint: 'xiaomi/lavender/lavender:10/QKQ1.190910.002/V12.5.1.0.QFGMIXM:user/release-keys'
12-01 13:01:57.354 20773 20773 F DEBUG   : Revision: '0'
12-01 13:01:57.354 20773 20773 F DEBUG   : ABI: 'arm64'
12-01 13:01:57.355 20773 20773 F DEBUG   : Timestamp: 2021-12-01 13:01:57+0000
12-01 13:01:57.355 20773 20773 F DEBUG   : pid: 20633, tid: 20725, name: tor  >>> org.torproject.jni.test <<<
12-01 13:01:57.355 20773 20773 F DEBUG   : uid: 10149
12-01 13:01:57.355 20773 20773 F DEBUG   : signal 35 (<debuggerd signal>), code -1 (SI_QUEUE), fault addr --------
12-01 13:01:57.355 20773 20773 F DEBUG   : Abort message: 'attempted to close file descriptor 62, expected to be unowned, actually owned by unique_fd 0x7bae8b2894'
12-01 13:01:57.355 20773 20773 F DEBUG   :     x0  0000000000000000  x1  00000000000050f5  x2  0000000000000023  x3  0000007b51762bb0
12-01 13:01:57.355 20773 20773 F DEBUG   :     x4  3832623865616237  x5  3832623865616237  x6  3832623865616237  x7  3439383262386561
12-01 13:01:57.355 20773 20773 F DEBUG   :     x8  00000000000000f0  x9  fc4ae377d41e87a7  x10 0000000000000000  x11 0000007c3e6dfcfa
12-01 13:01:57.355 20773 20773 F DEBUG   :     x12 0000000000000062  x13 0000007b5176278f  x14 0000007b5176277a  x15 0000000000000000
12-01 13:01:57.355 20773 20773 F DEBUG   :     x16 0000007c3e7b88c0  x17 0000007c3e794bc0  x18 0000007b4e5f8000  x19 0000000000005099
12-01 13:01:57.356 20773 20773 F DEBUG   :     x20 00000000000050f5  x21 0000007b51764020  x22 0000007c41d0c8f4  x23 0000000000000001
12-01 13:01:57.356 20773 20773 F DEBUG   :     x24 0000007b517629a0  x25 0000007b51762920  x26 0000007b517628e0  x27 0000000000000001
12-01 13:01:57.356 20773 20773 F DEBUG   :     x28 0000007b51763040  x29 0000007b51762c90
12-01 13:01:57.356 20773 20773 F DEBUG   :     sp  0000007b51762860  lr  0000007c3e7497d8  pc  0000007c3e7497f8
12-01 13:01:57.422 20778 20778 I crash_dump64: obtaining output fd from tombstoned, type: kDebuggerdTombstone
12-01 13:01:57.424  1184  1184 I /system/bin/tombstoned: received crash request for pid 20772
12-01 13:01:57.424  1184  1184 I /system/bin/tombstoned: enqueueing crash request for pid 20772
12-01 13:01:57.598 20773 20773 F DEBUG   : 
12-01 13:01:57.598 20773 20773 F DEBUG   : backtrace:
12-01 13:01:57.598 20773 20773 F DEBUG   :       #00 pc 00000000000867f8  /apex/com.android.runtime/lib64/bionic/libc.so (fdsan_error(char const*, ...)+388) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 13:01:57.598 20773 20773 F DEBUG   :       #01 pc 0000000000086610  /apex/com.android.runtime/lib64/bionic/libc.so (android_fdsan_close_with_tag+728) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 13:01:57.598 20773 20773 F DEBUG   :       #02 pc 0000000000086d7c  /apex/com.android.runtime/lib64/bionic/libc.so (close+16) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 13:01:57.598 20773 20773 F DEBUG   :       #03 pc 000000000017fa60  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_close_socket_simple+28)
12-01 13:01:57.598 20773 20773 F DEBUG   :       #04 pc 000000000017fae8  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_close_socket+32)
12-01 13:01:57.598 20773 20773 F DEBUG   :       #05 pc 00000000001f2728  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so
12-01 13:01:57.598 20773 20773 F DEBUG   :       #06 pc 00000000001f9da4  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (connection_free_all+140)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #07 pc 0000000000148dc0  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_free_all+100)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #08 pc 0000000000148d30  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_cleanup+184)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #09 pc 0000000000107968  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_run_main+1192)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #10 pc 0000000000105a40  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (Java_org_torproject_jni_TorService_runMain+36)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #11 pc 0000000000140350  /apex/com.android.runtime/lib64/libart.so (art_quick_generic_jni_trampoline+144) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #12 pc 0000000000137334  /apex/com.android.runtime/lib64/libart.so (art_quick_invoke_stub+548) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #13 pc 0000000000145fec  /apex/com.android.runtime/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+244) (BuildId: 64f7ca74f522
ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #14 pc 00000000002e3800  /apex/com.android.runtime/lib64/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread*, art::ArtMethod*, art::ShadowFrame*, unsigned short, a
rt::JValue*)+384) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #15 pc 00000000002dea60  /apex/com.android.runtime/lib64/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, u
nsigned short, art::JValue*)+892) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #16 pc 00000000005a3ae4  /apex/com.android.runtime/lib64/libart.so (MterpInvokeDirect+424) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #17 pc 0000000000131914  /apex/com.android.runtime/lib64/libart.so (mterp_op_invoke_direct+20) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #18 pc 00000000000aab18  [anon:dalvik-classes.dex extracted in memory from /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/base.apk] (org.torproject.jni.TorService.access$1000)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #19 pc 00000000005a452c  /apex/com.android.runtime/lib64/libart.so (MterpInvokeStatic+1040) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #20 pc 0000000000131994  /apex/com.android.runtime/lib64/libart.so (mterp_op_invoke_static+20) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #21 pc 00000000000a9c34  [anon:dalvik-classes.dex extracted in memory from /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/base.apk] (org.torproject.jni.TorService$3.run+724)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #22 pc 00000000002b4b14  /apex/com.android.runtime/lib64/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEbb.llvm.10247671929900336896+240) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #23 pc 0000000000592d24  /apex/com.android.runtime/lib64/libart.so (artQuickToInterpreterBridge+1032) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #24 pc 0000000000140468  /apex/com.android.runtime/lib64/libart.so (art_quick_to_interpreter_bridge+88) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #25 pc 0000000000137334  /apex/com.android.runtime/lib64/libart.so (art_quick_invoke_stub+548) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #26 pc 0000000000145fec  /apex/com.android.runtime/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+244) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.600 20773 20773 F DEBUG   :       #27 pc 00000000004b171c  /apex/com.android.runtime/lib64/libart.so (art::(anonymous namespace)::InvokeWithArgArray(art::ScopedObjectAccessAlreadyRunnable const&, art::ArtMethod*, art::(anonymous namespace)::ArgArray*, art::JValue*, char const*)+104) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.600 20773 20773 F DEBUG   :       #28 pc 00000000004b2830  /apex/com.android.runtime/lib64/libart.so (art::InvokeVirtualOrInterfaceWithJValues(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, jvalue const*)+416) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.600 20773 20773 F DEBUG   :       #29 pc 00000000004f31f8  /apex/com.android.runtime/lib64/libart.so (art::Thread::CreateCallback(void*)+1176) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.600 20773 20773 F DEBUG   :       #30 pc 00000000000e6b50  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 13:01:57.600 20773 20773 F DEBUG   :       #31 pc 0000000000084c6c  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: b9254a9320cb794714fda3ae99886b88)

akwizgran avatar Dec 01 '21 13:12 akwizgran

I haven't finished debugging this but I'll add some notes about what I've found so far.

So far I haven't been able to reproduce the crash when running only @grote's test case, but I can usually get a crash when running that test case along with the existing tests in TorServiceTest.java. (I had to fix a couple of tests - one was using a v2 onion URL, the other was using a plaintext HTTP URL which the API 29 device refused to access.)

Only a few of the crashes were related to fdsan errors. Most were caused by SIGABRT with one of the following two backtraces, both of which I saw multiple times:

12-01 13:01:59.383 20778 20778 F DEBUG   :       #00 pc 00000000000831f0  /apex/com.android.runtime/lib64/bionic/libc.so (abort+160) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #01 pc 0000000000170864  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_raw_abort_+12)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #02 pc 0000000000179de8  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_abort_+12)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #03 pc 000000000023c9ec  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (hs_circuitmap_init+132)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #04 pc 0000000000242fc8  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (hs_init+28)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #05 pc 0000000000106d48  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_init+128)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #06 pc 0000000000107564  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_run_main+164)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #07 pc 0000000000105a40  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (Java_org_torproject_jni_TorService_runMain+36)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #00 pc 00000000000831f0  /apex/com.android.runtime/lib64/bionic/libc.so (abort+160) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #01 pc 0000000000170864  /data/app/org.torproject.jni.test-d3CQfay99oHSZlslpR6oVg==/lib/arm64/libtor.so (tor_raw_abort_+12)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #02 pc 0000000000179de8  /data/app/org.torproject.jni.test-d3CQfay99oHSZlslpR6oVg==/lib/arm64/libtor.so (tor_abort_+12)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #03 pc 0000000000107448  /data/app/org.torproject.jni.test-d3CQfay99oHSZlslpR6oVg==/lib/arm64/libtor.so (pubsub_install+156)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #04 pc 0000000000107558  /data/app/org.torproject.jni.test-d3CQfay99oHSZlslpR6oVg==/lib/arm64/libtor.so (tor_run_main+152)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #05 pc 0000000000105a40  /data/app/org.torproject.jni.test-d3CQfay99oHSZlslpR6oVg==/lib/arm64/libtor.so (Java_org_torproject_jni_TorService_runMain+36)

I started to wonder whether it was possible for one test's call to runMain() to overlap with another test's call to the same method, which might result in two threads accessing the same static state inside the Tor library. The fdsan errors would be a specific case of this involving a file descriptor, while the other two backtraces would involve some other static state.

I added some debug logging and found that when a crash occurred, the previous test case's call to runMain() often returned after the crash.

However, the problem wasn't exactly caused by overlapping calls to runMain(), because the crashing test case hadn't called runMain() at the time of the crash (or, possibly, it had done so but then it had crashed before the log message could be written to the log).

I added a lock to prevent concurrent calls to runMain(), but this didn't help. I made the lock static so that different TorService instances would share the same lock, and this did seem to make a difference. Since making the lock static I've only managed to capture one crash, which was also the only run where the log indicated that a TorService instance was waiting for the lock - indicating that different test cases were indeed potentially calling into the Tor library concurrently from different threads.

Since the crash still happens, I think the problem is not exactly concurrent calls to runMain(), but rather concurrent calls to a set of native methods that includes runMain(). I'm going to increase the amount of code guarded by the lock to see if the crashes can be eliminated, and if so, which concurrent native calls are involved.

akwizgran avatar Dec 01 '21 15:12 akwizgran

I think the crashes are probably caused by calling mainConfigurationFree() from onDestroy() while runMain() is still running in another thread (or may not even have been called yet... the thread that calls it may be doing the other native calls to set up the command line and control socket).

Moving the native setup calls and runMain() into a try block, with mainConfigurationFree() in the finally block, and also using a static lock to prevent concurrent entry into that block, seems to prevent the various crashes. However, fdsan now reliably complains about mainConfigurationFree() closing Tor's owning_controller_socket. This socket is the "other end" of the socket that's returned to TorService, and Tor closes its end when the main configuration is released. This produces the following complaint from fdsan:

12-01 16:41:35.061 20764 20873 I TorService: torControlFd is 62 before freeing main configuration
12-01 16:41:35.061 20764 20873 I TorService: Freeing main configuration
12-01 16:41:35.063 20764 20873 E fdsan   : attempted to close file descriptor 63, expected to be unowned, actually owned by unique_fd 0x7c40818d54

(In this case the FD for TorService's end of the control connection was 62, while the FD for Tor's end was 63.)

This happens before the static lock is released, so I don't think the issue is that another TorService instance is messing with the contents of the main configuration. And yet I can only reproduce this when running the whole test suite... so some kind of interaction between the different test cases does seem to be involved.

Perhaps when TorService closes the control socket, the FD on Tor's end is automatically closed and released as well. But Tor still has a record of the FD number in its main configuration struct. Later, some other caller asks for an FD and the number previously used by Tor gets reallocated. (Perhaps this part only happens if other test cases are running, which is why I can't reproduce the problem with a single test case?). Then later, when mainConfigurationFree() is called, Tor tries to close the FD that was already closed earlier, and fdsan notices the problem?

akwizgran avatar Dec 01 '21 17:12 akwizgran

@akwizgran sounds like you're thinking that the crash isn't so much about the Control Socket but rather knowing when to call Tor's free/shutdown methods. That was my impression also, when we were going through the troubleshooting as discussed in https://gitlab.torproject.org/tpo/core/tor/-/issues/32729.

I also remember having the issue where these kinds of crashes happened only when running the whole test suite, where there were repeated start/stop cycles. The hard part is knowing exactly what is not cleaned up between runs, the docs never seem to fully outline it. Instead, there are statements like (keyword "most"):

Each test runs in its own Instrumentation instance. Therefore, if your tests share app state, most of that shared state is removed from your device's CPU or memory after each test. [1]

There is also the poorly documented runner option:

   // The following argument makes the Android Test Orchestrator run its
   // "pm clear" command after each test invocation. This command ensures
   // that the app's state is completely cleared between tests.
   testInstrumentationRunnerArguments clearPackageData: 'true'

Also, FYI, the core JNI code used here came from @sysrqb, and I wrapped that with the UNIX domain socket. @sysrqb do you have any thoughts on a proper shutdown process when running tor in a thread?

eighthave avatar Dec 02 '21 08:12 eighthave

@akwizgran sounds like you're thinking that the crash isn't so much about the Control Socket but rather knowing when to call Tor's free/shutdown methods.

Right. Specifically, I think the crash is caused by calling mainConfigurationFree() when the thread that makes the other native calls (most importantly runMain()) is still running. The documentation for tor_main_configuration_free() in tor_api.h says:

Once you have passed a tor_main_configuration_t to tor_run_main(), you must not free it until tor_run_main() has finished.

Moving mainConfigurationFree() to the same thread as the other native calls, and using a static lock to ensure that only one TorService instance per process is making native calls, prevents the crash. I'll clean these changes up and push them to a branch.

There's also some good news about the fdsan error. I can now reproduce this with only a single test case, which only binds and unbinds the service once. The error doesn't happen in every run, but in most runs fdsan warns about the file descriptor that's closed inside mainConfigurationFree(). If the test case finishes immediately then fdsan logs the initial warning but doesn't have time to log a backtrace or write a tombstone. Adding a one-second sleep to the end of the test case allows enough time for that to happen.

I think it was probably possible to reproduce the fdsan error before, but I was expecting to see a crash, so I ignored all the runs where the tests passed without crashing. That was a mistake. In the default configuration, fdsan doesn't terminate the process. It records a tombstone for the first fdsan violation, then disables further fdsan checks. So when I saw fdsan tombstones after a crash, they weren't the cause of the crash. Once I had prevented the crashes this became clear.

As far as I can tell, the fdsan error is unrelated to the crashes, and we might want to open a separate ticket for it. It's possible that it's a bug in Tor, where owning_controller_socket ought to be cleared if the control socket is closed before tor_main_configuration_free() is called, in order to avoid a double close.

I believe the reason we don't see the fdsan warning on every run is that fdsan only detects the double close if the file descriptor number is reallocated between the time when Tor's end of the control socket is closed and the time when tor_main_configuration_free() is called.

akwizgran avatar Dec 02 '21 15:12 akwizgran

Great work! :tada:

In the default configuration, fdsan doesn't terminate the process. It records a tombstone for the first fdsan violation, then disables further fdsan checks.

Could it be that my API 31 device is configured differently? When using your branch it is still crashing when running the tests. With Android Studio, there's occasional crashes, but when running them with ./gradlew connectedAndroidTest, it seems to crash every time pointing at fdsan as the culprit.

The tombstone it wrote has quite some extensive debug information, including a list of file descriptors that are currently open. https://pastebin.com/kmmT83yD

grote avatar Dec 02 '21 16:12 grote

Could it be that my API 31 device is configured differently?

Ah, good thought!

https://developer.android.com/about/versions/11/behavior-changes-all#fdsan

"The default mode for fdsan is changing in Android 11. fdsan now aborts upon detecting an error; the previous behavior was to log a warning and continue."

akwizgran avatar Dec 02 '21 16:12 akwizgran

The tombstone it wrote has quite some extensive debug information, including a list of file descriptors that are currently open.

Nice!

12-02 16:40:53.105 12452 12491 E fdsan   : attempted to close file descriptor 70, expected to be unowned, actually owned by unique_fd 0x7c197bee44
    fd 69: socket:[1960483] (unowned)
    fd 70: anon_inode:[eventfd] (unowned)

Here's my theory: FD 69 and 70 were originally allocated to the control socket, with 69 being TorService's end and 70 being Tor's end. Tor closed its end of the socket during shutdown and FD 70 was reused by anon_inode:[eventfd] (whatever that means... anyway it's not socket like FD 69). When tor_main_configuration_free() tried to close its owning_controller_socket, fdsan noticed the double close.

akwizgran avatar Dec 02 '21 17:12 akwizgran

To add weight to this theory, you could log the value of torControlFd in TorService before calling mainConfigurationFree(). The FD for TorService's end of the socket should be one less than the FD that fdsan complains about, and hopefully the tombstone should still show its type as socket, whereas the FD that fdsan complains about will most likely be reused as some other type.

akwizgran avatar Dec 02 '21 17:12 akwizgran

BTW, since the fdsan error does cause a crash on API 31, I'll update my PR to say that it fixes Tor's #32729 but not this issue.

akwizgran avatar Dec 02 '21 17:12 akwizgran

you could log the value of torControlFd in TorService before calling mainConfigurationFree(). The FD for TorService's end of the socket should be one less than the FD that fdsan complains about, and hopefully the tombstone should still show its type as socket, whereas the FD that fdsan complains about will most likely be reused as some other type.

Sounds like you are on to something:

E TorService: torControlFd: 68
E fdsan   : attempted to close file descriptor 69, expected to be unowned, actually owned by unique_fd 0x7c19653ba4
    fd 68: socket:[1962614] (unowned)
    fd 69: anon_inode:[eventfd] (unowned)

I tried this a second time and got the same, but with 69/70.

grote avatar Dec 02 '21 17:12 grote

Tor closed its end of the socket during shutdown and the file descriptor was reused. When tor_main_configuration_free() tried to close its owning_controller_socket, fdsan noticed the double close.

@asn-d6 @ahf This sounds like a bug in Tor. Can you confirm and see if this is already tracked by you upstream?

grote avatar Dec 02 '21 18:12 grote

A small experiment to do here: Would it be possible to try to run cfg->owning_controller_socket = -1; before the call to tor_main_configuration_free(cfg); in Java_org_torproject_jni_TorService_mainConfigurationFree() ?

I do indeed suspect we have a double close here where our API closes the fd after our main loop have already closed it, but I need to talk with Nick to confirm something around our main loop first.

Setting the value to -1 means that SOCKET_OK() should ignore it and the close() call should be skipped in the free function.

ahf avatar Dec 07 '21 19:12 ahf

Tracking this in https://gitlab.torproject.org/tpo/core/tor/-/issues/40525 for the Tor side.

ahf avatar Dec 07 '21 20:12 ahf