[...] *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
[...] ABI: 'arm64'
[...] Timestamp: 2023-02-03 17:45:56.762880900+0100
[...] Process uptime: 734s
[...] Cmdline: org.openobservatory.ooniprobe.experimental
[...] signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
[...]
[...] backtrace:
[...] #00 pc 0000000000051994 .../lib64/bionic/libc.so  (abort+164) (...)
[...] #01 pc 00000000012f429c .../lib/arm64/libgojni.so (tor_raw_abort_+12)
[...] #02 pc 0000000001301594 .../lib/arm64/libgojni.so (tor_abort_+12)
[...] #03 pc 000000000117d1e0 .../lib/arm64/libgojni.so (pubsub_install+156)
[...] #04 pc 000000000117d2ec .../lib/arm64/libgojni.so (tor_run_main+148)
[...] #05 pc 000000000117c004 .../lib/arm64/libgojni.so (_cgo_2d785783cadf_Cfunc_tor_run_main+32)

Seems to be a not-so-frequent event. It occurred just once while testing https://github.com/ooni/probe/issues/2405.

On 2023-02-07, the issues appeared again. The logcat contains these logs right before the crash:

2023-02-07 10:28:14.319 24818-27589 MK_EVENT  org.openobservatory.ooniprobe.dev  \
    D  {"key":"log","value":{"log_level":"INFO","message":"ptx: started socks listener at 127.0.0.1:43165"}}
2023-02-07 10:28:14.320 24818-27589 MK_EVENT org.openobservatory.ooniprobe.dev  \
    D  {"key":"log","value":{"log_level":"INFO","message":"torsf: rendezvous method: ''"}}
2023-02-07 10:28:14.921 24818-27849 libc org.openobservatory.ooniprobe.dev \
    A  Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 27849 (Thread-14), pid 24818 (y.ooniprobe.dev)

Therefore, it seems this issue is related to the torsf experiment.

Feb 03 '23 16:02 bassosimone

The relevant tor code is the following:

/** Install the publish/subscribe relationships for all the subsystems. */
void
pubsub_install(void)
{
    pubsub_builder_t *builder = pubsub_builder_new();
    int r = subsystems_add_pubsub(builder);
    tor_assert(r == 0);
    r = tor_mainloop_connect_pubsub(builder); // consumes builder
    tor_assert(r == 0);
}

It's not completely clear to me whether the first or the second tor_assert fails.

Feb 07 '23 11:02 bassosimone

I applied the following patch to investigate:

diff --git a/src/app/main/main.c b/src/app/main/main.c
index 838e129d04..3aa1b334fa 100644
--- a/src/app/main/main.c
+++ b/src/app/main/main.c
@@ -92,6 +92,8 @@
 #include <unistd.h>
 #endif
 
+#include <stdio.h>
+
 #ifdef HAVE_SYSTEMD
 #   if defined(__COVERITY__) && !defined(__INCLUDE_LEVEL__)
 /* Systemd's use of gcc's __INCLUDE_LEVEL__ extension macro appears to confuse
@@ -1284,8 +1286,16 @@ pubsub_install(void)
 {
     pubsub_builder_t *builder = pubsub_builder_new();
     int r = subsystems_add_pubsub(builder);
+    if (r != 0) {
+        fprintf(stderr, "SBSDEBUG: subsystems_add_pubsub failed: %d\n", r);
+        /* FALLTHROUGH */
+    }
     tor_assert(r == 0);
     r = tor_mainloop_connect_pubsub(builder); // consumes builder
+    if (r != 0) {
+        fprintf(stderr, "SBSDEBUG: tor_mainloop_connect_pubsub failed: %d\n", r);
+        /* FALLTHROUGH */
+    }
     tor_assert(r == 0);
 }

I saw the following logs before the usual crash:

2023-02-07 13:36:10.013 18333-18741 TestAsyncTask  org...vatory.ooniprobe.experimental \
    D  run next suite: experimental test:torsf

2023-02-07 13:36:10.483 18333-18741 MK_EVENT org...vatory.ooniprobe.experimental \
     D  {"key":"status.report_create","value":{"report_id":"20230207T123609Z_torsf_IT_30722_n1_oNthkv3lRtrxlslq"}}

2023-02-07 13:36:10.492 18333-19119 GoLog org...vatory.ooniprobe.experimental  \
    E  SBSDEBUG: tor_mainloop_connect_pubsub failed: -1

2023-02-07 13:36:12.351 18333-18741 MK_EVENT org...vatory.ooniprobe.experimental  \
    D  {"key":"log","value":{"log_level":"INFO","message":"torsf: rendezvous method: ''"}}

2023-02-07 13:36:12.351 18333-18989 libc  org...vatory.ooniprobe.experimental  \
    A  Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 18989 (Thread-8), pid 18333 (be.experimental)

2023-02-07 13:36:12.352 18333-18741 MK_EVENT org...vatory.ooniprobe.experimental  \
    D  {"key":"log","value":{"log_level":"INFO","message":"torsf: disable persistent datadir: false"}}

2023-02-07 13:36:12.352 18333-18741 MK_EVENT org...vatory.ooniprobe.experimental  \
    D  {"key":"log","value":{"log_level":"INFO","message":"tunnel: tor: exec: \u003cinternal/libtor\u003e /data/data/org.openobservatory.ooniprobe.experimental/files/tunnel/torsf/tor UseBridges 1 ClientTransportPlugin snowflake socks5 127.0.0.1:43071 Bridge snowflake 192.0.2.3:1 2B280B23E1107BB62ABFC40DDCC8824814F80A72 Log notice stderr Log notice file /data/data/org.openobservatory.ooniprobe.experimental/files/tunnel/torsf/tor/tor.log"}}

2023-02-07 13:36:12.539 18333-18741 MK_EVENT rg...vatory.ooniprobe.experimental  \
    D  {"key":"status.progress","value":{"message":"torsf: elapsedTime: 2 s; maxRuntime: 600 s","percentage":0.402056216507}}

So, we see that the issue happens when we run torsf also in this case. We also know that the issue happens inside of tor_mainloop_connect_pubsub. We need to continue debugging to understand what fails.

It's also worth mentioning that this kind of crash is currently the number one cause of crashes in the Android app based on the information that we could retrieve from the Google Play console. So, it's worth investigating it.

Feb 07 '23 14:02 bassosimone

I wrote an extended patch to log more messages:

diff --git a/src/app/main/main.c b/src/app/main/main.c
index 838e129d04..3aa1b334fa 100644
--- a/src/app/main/main.c
+++ b/src/app/main/main.c
@@ -92,6 +92,8 @@
 #include <unistd.h>
 #endif
 
+#include <stdio.h>
+
 #ifdef HAVE_SYSTEMD
 #   if defined(__COVERITY__) && !defined(__INCLUDE_LEVEL__)
 /* Systemd's use of gcc's __INCLUDE_LEVEL__ extension macro appears to confuse
@@ -1284,8 +1286,16 @@ pubsub_install(void)
 {
     pubsub_builder_t *builder = pubsub_builder_new();
     int r = subsystems_add_pubsub(builder);
+    if (r != 0) {
+        fprintf(stderr, "SBSDEBUG: subsystems_add_pubsub failed: %d\n", r);
+        /* FALLTHROUGH */
+    }
     tor_assert(r == 0);
     r = tor_mainloop_connect_pubsub(builder); // consumes builder
+    if (r != 0) {
+        fprintf(stderr, "SBSDEBUG: tor_mainloop_connect_pubsub failed: %d\n", r);
+        /* FALLTHROUGH */
+    }
     tor_assert(r == 0);
 }
 
diff --git a/src/core/mainloop/mainloop_pubsub.c b/src/core/mainloop/mainloop_pubsub.c
index 1e72ada5f0..72dd348a2e 100644
--- a/src/core/mainloop/mainloop_pubsub.c
+++ b/src/core/mainloop/mainloop_pubsub.c
@@ -26,6 +26,8 @@
 #include "lib/pubsub/pubsub.h"
 #include "lib/pubsub/pubsub_build.h"
 
+#include <stdio.h>
+
 /**
  * Dispatcher to use for delivering messages.
  **/
@@ -63,8 +65,10 @@ tor_mainloop_connect_pubsub(struct pubsub_builder_t *builder)
   tor_mainloop_disconnect_pubsub();
 
   the_dispatcher = pubsub_builder_finalize(builder, &the_pubsub_items);
-  if (! the_dispatcher)
+  if (! the_dispatcher) {
+    fprintf(stderr, "SBSDEBUG: the_distpatcher is NULL\n");
     goto err;
+  }
 
   rv = 0;
   goto done;
diff --git a/src/lib/pubsub/pubsub_build.c b/src/lib/pubsub/pubsub_build.c
index 30b9194062..6a554c06bd 100644
--- a/src/lib/pubsub/pubsub_build.c
+++ b/src/lib/pubsub/pubsub_build.c
@@ -25,7 +25,8 @@
 #include "lib/log/util_bug.h"
 #include "lib/malloc/malloc.h"
 
- #include <string.h>
+#include <string.h>
+#include <stdio.h>
 
 /** Construct and return a new empty pubsub_items_t. */
 static pubsub_items_t *
@@ -281,19 +282,24 @@ pubsub_builder_finalize(pubsub_builder_t *builder,
   dispatch_t *dispatcher = NULL;
   tor_assert_nonfatal(builder->n_connectors == 0);
 
-  if (pubsub_builder_check(builder) < 0)
+  if (pubsub_builder_check(builder) < 0) {
+    fprintf(stderr, "SBSDEBUG: pubsub_builder_check failed\n");
     goto err;
+  }
 
   if (builder->n_errors) {
     log_warn(LD_GENERAL, "At least one error occurred previously when "
              "configuring the dispatcher.");
+    fprintf(stderr, "SBSDEBUG: builder->n_errors is not zero\n");
     goto err;
   }
 
   dispatcher = dispatch_new(builder->cfg);
 
-  if (!dispatcher)
+  if (!dispatcher) {
+    fprintf(stderr, "SBSDEBUG: dispatcher_new failed\n");
     goto err;
+  }
 
   pubsub_items_install_bindings(builder->items, dispatcher);
   if (items_out) {

This is an excerpt from the logs:

2023-02-07 17:33:58.220  2085-2292  MK_EVENT org...vatory.ooniprobe.experimental  \
    D  {"key":"status.progress","value":{"message":"open report","percentage":0.4}}

2023-02-07 17:33:58.220  2085-2292  MK_EVENT org...vatory.ooniprobe.experimental \
    D  {"key":"status.report_create","value":{"report_id":"20230207T163357Z_torsf_IT_30722_n1_i0fyR61zbwOLzJMd"}}

2023-02-07 17:33:58.220  2085-2292  MK_EVENT org...vatory.ooniprobe.experimental  \
    D  {"key":"log","value":{"log_level":"INFO","message":"Starting measurement with index 0"}}

2023-02-07 17:33:58.220  2085-2292  MK_EVENT org...vatory.ooniprobe.experimental  \
    D  {"key":"status.measurement_start","value":{"idx":0,"input":""}}

2023-02-07 17:33:58.226  2085-2506  GoLog org...vatory.ooniprobe.experimental  \
    E  SBSDEBUG: pubsub_builder_check failed

2023-02-07 17:33:58.226  2085-2506  GoLog  org...vatory.ooniprobe.experimental  \
    E  SBSDEBUG: the_distpatcher is NULL

2023-02-07 17:33:58.231  2085-2506  GoLog  org...vatory.ooniprobe.experimental  \
    E  SBSDEBUG: tor_mainloop_connect_pubsub failed: -1

2023-02-07 17:33:58.607  2085-19668 libc  org...vatory.ooniprobe.experimental  \
    A  Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 19668 (AsyncTask #1), pid

So, now we know that the function that fails is pubsub_builder_check. 😅

Feb 07 '23 16:02 bassosimone

Since I introduced the following diff:

diff --git a/src/lib/pubsub/pubsub_check.c b/src/lib/pubsub/pubsub_check.c
index 99e604d715..a9ae957551 100644
--- a/src/lib/pubsub/pubsub_check.c
+++ b/src/lib/pubsub/pubsub_check.c
@@ -25,6 +25,7 @@
 #include "lib/malloc/malloc.h"
 #include "lib/string/compat_string.h"
 
+#include <stdio.h>
 #include <string.h>
 
 static void pubsub_adjmap_add(pubsub_adjmap_t *map,
@@ -343,21 +344,27 @@ lint_message(const pubsub_adjmap_t *map, message_id_t msg)
     log_warn(LD_MESG|LD_BUG,
              "Message \"%s\" has subscribers, but no publishers.",
              get_message_id_name(msg));
+    fprintf(stderr, "SBSDEBUG: n_pub == 0\n");
     ok = false;
   } else if (n_sub == 0) {
     log_warn(LD_MESG|LD_BUG,
              "Message \"%s\" has publishers, but no subscribers.",
              get_message_id_name(msg));
+    fprintf(stderr, "SBSDEBUG: n_sub == 0\n");
     ok = false;
   }
 
   /* Check the message graph topology. */
-  if (lint_message_graph(map, msg, pub, sub) < 0)
+  if (lint_message_graph(map, msg, pub, sub) < 0) {
+    fprintf(stderr, "SBSDEBUG: lint_message_graph failed\n");
     ok = false;
+  }
 
   /* Check whether the messages have the same fields set on them. */
-  if (lint_message_consistency(msg, pub, sub) < 0)
+  if (lint_message_consistency(msg, pub, sub) < 0) {
+    fprintf(stderr, "SBSDEBUG: lint_message_consistency failed\n");
     ok = false;
+  }
 
   if (!ok) {
     /* There was a problem -- let's log all the publishers and subscribers on
@@ -385,6 +392,7 @@ pubsub_adjmap_check(const pubsub_adjmap_t *map)
   bool all_ok = true;
   for (unsigned i = 0; i < map->n_msgs; ++i) {
     if (lint_message(map, i) < 0) {
+      fprintf(stderr, "lint_message failed for %u\n", i);
       all_ok = false;
     }
   }
@@ -401,11 +409,15 @@ pubsub_builder_check(pubsub_builder_t *builder)
   pubsub_adjmap_t *map = pubsub_build_adjacency_map(builder->items);
   int rv = -1;
 
-  if (!map)
+  if (!map) {
+    fprintf(stderr, "SBSDEBUG: pubsub_build_adjacency_map failed\n");
     goto err; // should be impossible
+  }
 
-  if (pubsub_adjmap_check(map) < 0)
+  if (pubsub_adjmap_check(map) < 0) {
+    fprintf(stderr, "SBSDEBUG: pubsub_adjmap_check failed\n");
     goto err;
+  }
 
   rv = 0;
  err:

I am not able to reproduce the problem anymore. I run 39 tests with this diff plus other fprintf diffs and there was no crash. I run 22 tests with just this diff applied and there was no crash. What am I missing? Heisenbug?

Feb 09 '23 10:02 bassosimone

After reverting all the patches, I finally managed to see another crash.

Here is the logcat:

02-09 15:07:42.777 27545  4515 F libc    : Fatal signal 6 (SIGABRT), \
    code -1 (SI_QUEUE) in tid 4515 (Thread-8), pid 27545 (be.experimental)

02-09 15:07:42.777 27545 27649 D MK_EVENT: {"key":"log","value": \
    {"log_level":"INFO","message":"torsf: rendezvous method: ''"}}

02-09 15:07:42.777 27545 27649 D MK_EVENT: {"key":"log","value": \
    {"log_level":"INFO","message":"torsf: disable persistent datadir: false"}}

02-09 15:07:43.786 17602 17602 F DEBUG   :       #01 pc 00000000012f214c \
     .../lib/arm64/libgojni.so (tor_raw_abort_+12)

02-09 15:07:43.786 17602 17602 F DEBUG   :       #02 pc 00000000012ff444  \
    .../lib/arm64/libgojni.so (tor_abort_+12)

02-09 15:07:43.786 17602 17602 F DEBUG   :       #03 pc 000000000117b090  \
    .../lib/arm64/libgojni.so (pubsub_install+156)

This was the 13th run. However, it was the third run in a row where I was running all tests as opposed to just running tor.

Feb 09 '23 14:02 bassosimone

I spent some time wondering whether this crash could be a memory corruption caused by code we introduced in https://github.com/ooni/probe-cli/pull/1052 and other pull requests. To understand whether that was the case, I searched for crashes in the Google Play Console. I found the same crash for ooniprobe-android 3.7.3 (95). This release of the Android app is definitely using a version engine that predates https://github.com/ooni/probe-cli/pull/1052.

Here's one of the stacktrace(s) provided by the Google Play console (others look similar):

backtrace:
  #00  pc 0x0000000000089b0c  .../lib64/bionic/libc.so (abort+164)
  #01  pc 0x00000000013778a4  .../split_config.arm64_v8a.apk (tor_raw_abort_+12)
  #02  pc 0x0000000001382150  .../split_config.arm64_v8a.apk (tor_abort_+12)
  #03  pc 0x00000000012470a0  .../split_config.arm64_v8a.apk (pubsub_install+120)
  #04  pc 0x0000000001247170  .../split_config.arm64_v8a.apk (tor_run_main+136)

This event occurred 526 times in the last 28 days. There are more crashes that look like this one, but it is not clear to me how come that they are not grouped together. Perhaps the strings of the stacktrace slightly differ depending on the actual mechanism with which the app has been installed.

Feb 09 '23 14:02 bassosimone

With https://github.com/ooni/probe-cli/pull/1077/commits/a372027d3a2060fce9d77cbee5c91dd4b2a0f705, I tried to make a reproducible test case for Linux but, so far, no luck.

Mar 21 '23 15:03 bassosimone

Hooray! I could finally reproduce the issue on Linux after 26 repetitions. This is the relevant crash log:

2023/03/21 15:33:18  info ptx: started socks listener at 127.0.0.1:36089
2023/03/21 15:33:18  info torsf: rendezvous method: ''
2023/03/21 15:33:18  info torsf: disable persistent datadir: false
2023/03/21 15:33:18  info tunnel: tor: exec: <internal/libtor> x/tunnel/torsf/tor UseBridges 1 \
    ClientTransportPlugin snowflake socks5 127.0.0.1:36089 Bridge snowflake 192.0.2.3:1 \
   2B280B23E1107BB62ABFC40DDCC8824814F80A72 Log notice stderr Log \
   notice file x/tunnel/torsf/tor/tor.log

SBSDEBUG: n_sub == 0
lint_message failed for 98
SBSDEBUG: n_pub == 0
lint_message failed for 116
SBSDEBUG: pubsub_adjmap_check failed
SIGABRT: abort
PC=0x7fe9efb66a7c m=5 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 6396 [syscall]:
runtime.cgocall(0xdf6640, 0xc00015de10)
	.../src/runtime/cgocall.go:158 +0x5c fp=0xc00015dde8 sp=0xc00015ddb0 pc=0x49df1c
github.com/ooni/probe-cli/v3/internal/libtor._Cfunc_tor_run_main(0x7fe9b01a2290)
	_cgo_gotypes.go:204 +0x4c fp=0xc00015de10 sp=0xc00015dde8 pc=0xdec92c
github.com/ooni/probe-cli/v3/internal/libtor.(*torProcess).runtor.func7(0x7fe9b01a2290)
	.../ooni/probe-cli/internal/libtor/enabled.go:252 +0x3f fp=0xc00015de40 sp=0xc00015de10 pc=0xded99f
github.com/ooni/probe-cli/v3/internal/libtor.(*torProcess).runtor(0xc000327620, {0x152c848, 0xc000668c80}, {0x1535018?, 0xc0005ced80}, {0xc000038400, 0x14, 0xc00019cf00?})
	.../ooni/probe-cli/internal/libtor/enabled.go:252 +0x685 fp=0xc00015df90 sp=0xc00015de40 pc=0xded665
github.com/ooni/probe-cli/v3/internal/libtor.(*torCreator).New.func1()
	.../ooni/probe-cli/internal/libtor/enabled.go:93 +0x45 fp=0xc00015dfe0 sp=0xc00015df90 pc=0xdecd05
runtime.goexit()
	.../src/runtime/asm_amd64.s:1594 +0x1 fp=0xc00015dfe8 sp=0xc00015dfe0 pc=0x501141
created by github.com/ooni/probe-cli/v3/internal/libtor.(*torCreator).New
	.../ooni/probe-cli/internal/libtor/enabled.go:93 +0x23d

The above log provides us with the following information:

pubsub_adjmap_check is the function called by pubsub_builder_check that fails;
lint_message is the function called by pubsub_adjmap_check that fails;
lint_message fails for message 98, which has no subscribers;
lint_message fails for message 116, which has no publishers.

I still do not fully understand what these messages are and why would this happen.

There is also something weird with the number of blocked goroutines; see https://github.com/ooni/probe/issues/2443.

Mar 21 '23 15:03 bassosimone

It turns out the previous patch produced not-so-actionable data. Here's a better patch:

diff --git a/src/lib/pubsub/pubsub_check.c b/src/lib/pubsub/pubsub_check.c
index 99e604d715..a5cc4b7658 100644
--- a/src/lib/pubsub/pubsub_check.c
+++ b/src/lib/pubsub/pubsub_check.c
@@ -25,6 +25,7 @@
 #include "lib/malloc/malloc.h"
 #include "lib/string/compat_string.h"
 
+#include <stdio.h>
 #include <string.h>
 
 static void pubsub_adjmap_add(pubsub_adjmap_t *map,
@@ -343,21 +344,27 @@ lint_message(const pubsub_adjmap_t *map, message_id_t msg)
     log_warn(LD_MESG|LD_BUG,
              "Message \"%s\" has subscribers, but no publishers.",
              get_message_id_name(msg));
+    fprintf(stderr, "SBSDEBUG: n_pub == 0 for %s\n", get_message_id_name(msg));
     ok = false;
   } else if (n_sub == 0) {
     log_warn(LD_MESG|LD_BUG,
              "Message \"%s\" has publishers, but no subscribers.",
              get_message_id_name(msg));
+    fprintf(stderr, "SBSDEBUG: n_sub == 0 for %s\n", get_message_id_name(msg));
     ok = false;
   }
 
   /* Check the message graph topology. */
-  if (lint_message_graph(map, msg, pub, sub) < 0)
+  if (lint_message_graph(map, msg, pub, sub) < 0) {
+    fprintf(stderr, "SBSDEBUG: lint_message_graph failed for %s\n", get_message_id_name(msg));
     ok = false;
+  }
 
   /* Check whether the messages have the same fields set on them. */
-  if (lint_message_consistency(msg, pub, sub) < 0)
+  if (lint_message_consistency(msg, pub, sub) < 0) {
+    fprintf(stderr, "SBSDEBUG: lint_message_consistency failed for %s\n", get_message_id_name(msg));
     ok = false;
+  }
 
   if (!ok) {
     /* There was a problem -- let's log all the publishers and subscribers on
@@ -385,6 +392,7 @@ pubsub_adjmap_check(const pubsub_adjmap_t *map)
   bool all_ok = true;
   for (unsigned i = 0; i < map->n_msgs; ++i) {
     if (lint_message(map, i) < 0) {
+      fprintf(stderr, "SBSDEBUG: lint_message failed for %u %s\n", i, get_message_id_name((message_id_t)i));
       all_ok = false;
     }
   }
@@ -401,11 +409,15 @@ pubsub_builder_check(pubsub_builder_t *builder)
   pubsub_adjmap_t *map = pubsub_build_adjacency_map(builder->items);
   int rv = -1;
 
-  if (!map)
+  if (!map) {
+    fprintf(stderr, "SBSDEBUG: pubsub_build_adjacency_map failed\n");
     goto err; // should be impossible
+  }
 
-  if (pubsub_adjmap_check(map) < 0)
+  if (pubsub_adjmap_check(map) < 0) {
+    fprintf(stderr, "SBSDEBUG: pubsub_adjmap_check failed\n");
     goto err;
+  }
 
   rv = 0;
  err:

Now, let's go fishing for SIGABRT again! 🎣

Mar 21 '23 16:03 bassosimone

This time it crashed in a completely different way after 24 runs:

2023/03/21 16:49:03  info ptx: started socks listener at 127.0.0.1:32985
2023/03/21 16:49:03  info torsf: rendezvous method: ''
2023/03/21 16:49:03  info torsf: disable persistent datadir: false
2023/03/21 16:49:03  info tunnel: tor: exec: <internal/libtor> x/tunnel/torsf/tor UseBridges 1 \
    ClientTransportPlugin snowflake socks5 127.0.0.1:32985 Bridge snowflake 192.0.2.3:1 \
    2B280B23E1107BB62ABFC40DDCC8824814F80A72 Log notice stderr Log \
    notice file x/tunnel/torsf/tor/tor.log
BUG: subsystem btrack (at 55) could not connect to publish/subscribe system.
============================================================ T= 1679417343
INTERNAL ERROR: Raw assertion failed in Tor 0.4.7.13 at src/app/main/subsysmgr.c:183: 0
A subsystem couldn't be connected.
./testtorsf(dump_stack_symbols_to_error_fds+0x58)[0xe6df08]
./testtorsf(tor_raw_assertion_failed_msg_+0x97)[0xe6e8d7]
./testtorsf(subsystems_add_pubsub_upto+0x128)[0xe47df8]
./testtorsf(pubsub_install+0x29)[0xdf9c99]
./testtorsf(tor_run_main+0x8a)[0xdf9e2a]
./testtorsf(_cgo_2d785783cadf_Cfunc_tor_run_main+0x1b)[0xdf665b]
./testtorsf[0x500e04]

SIGABRT: abort
PC=0x7fea243eba7c m=8 sigcode=18446744073709551610
signal arrived during cgo execution

This is not the error I expected, but it still seems interesting because it's again related to pubsub.

I am going to fish for a couple more errors before trying to use tsan. 🎣

Mar 21 '23 17:03 bassosimone

Another crash (we are starting to shed some light on what happens, I think) after 17 runs:

2023/03/21 17:38:19  info ptx: started socks listener at 127.0.0.1:42891
2023/03/21 17:38:19  info torsf: rendezvous method: ''
2023/03/21 17:38:19  info torsf: disable persistent datadir: false
2023/03/21 17:38:19  info tunnel: tor: exec: <internal/libtor> x/tunnel/torsf/tor [snip]
SBSDEBUG: n_sub == 0 for orconn_state
SBSDEBUG: lint_message failed for 20 orconn_state
SBSDEBUG: n_pub == 0 for orconn_state
SBSDEBUG: lint_message failed for 116 orconn_state
SBSDEBUG: pubsub_adjmap_check failed
SIGABRT: abort
PC=0x7f6faf3a7a7c m=10 sigcode=18446744073709551610
signal arrived during cgo execution

Another crash after 16 runs:

2023/03/21 17:59:13  info tunnel: tor: exec: <internal/libtor> x/tunnel/torsf/tor [...]
BUG: subsystem btrack (at 55) could not connect to publish/subscribe system.
============================================================ T= 1679421553
INTERNAL ERROR: Raw assertion failed in Tor 0.4.7.13 at src/app/main/subsysmgr.c:183: 0
A subsystem couldn't be connected.
./testtorsf(dump_stack_symbols_to_error_fds+0x58)[0xe6df08]
./testtorsf(tor_raw_assertion_failed_msg_+0x97)[0xe6e8d7]
./testtorsf(subsystems_add_pubsub_upto+0x128)[0xe47df8]
./testtorsf(pubsub_install+0x29)[0xdf9c99]
./testtorsf(tor_run_main+0x8a)[0xdf9e2a]
./testtorsf(_cgo_2d785783cadf_Cfunc_tor_run_main+0x1b)[0xdf665b]
./testtorsf[0x500e04]

SIGABRT: abort
PC=0x7fa00f89aa7c m=14 sigcode=18446744073709551610
signal arrived during cgo execution

Another crash after 6 runs:

2023/03/21 18:10:02  info tunnel: tor: exec: <internal/libtor> x/tunnel/torsf/tor [...]
SBSDEBUG: n_sub == 0 for orconn_state
SBSDEBUG: lint_message failed for 15 orconn_state
SBSDEBUG: n_pub == 0 for orconn_state
SBSDEBUG: lint_message failed for 25 orconn_state
SBSDEBUG: pubsub_adjmap_check failed
SIGABRT: abort
PC=0x7f88f5fc1a7c m=13 sigcode=18446744073709551610
signal arrived during cgo execution

Another crash after 6 runs:

2023/03/22 06:41:56  info tunnel: tor: exec: <internal/libtor> x/tunnel/torsf/tor [...]
SBSDEBUG: n_sub == 0 for orconn_status
SBSDEBUG: lint_message failed for 16 orconn_status
SBSDEBUG: n_sub == 0 for orconn_state
SBSDEBUG: lint_message failed for 20 orconn_state
SBSDEBUG: n_pub == 0 for orconn_state
SBSDEBUG: lint_message failed for 33 orconn_state
SBSDEBUG: n_pub == 0 for orconn_status
SBSDEBUG: lint_message failed for 34 orconn_status
SBSDEBUG: pubsub_adjmap_check failed
SIGABRT: abort
PC=0x7fa2d8241a7c m=13 sigcode=18446744073709551610
signal arrived during cgo execution

Another crash after 6 runs:

023/03/22 06:53:37  info tunnel: tor: exec: <internal/libtor> x/tunnel/torsf/tor [...]
SBSDEBUG: n_sub == 0 for orconn_state
SBSDEBUG: lint_message failed for 20 orconn_state
SBSDEBUG: n_pub == 0 for orconn_state
SBSDEBUG: lint_message failed for 33 orconn_state
SBSDEBUG: pubsub_adjmap_check failed
SIGABRT: abort
PC=0x7f00acb2ca7c m=14 sigcode=18446744073709551610
signal arrived during cgo execution

Another crash after 6 runs:

2023/03/22 07:10:44  info tunnel: tor: exec: <internal/libtor> x/tunnel/torsf/tor [...]
SBSDEBUG: n_sub == 0 for orconn_status
SBSDEBUG: lint_message failed for 16 orconn_status
SBSDEBUG: n_pub == 0 for orconn_status
SBSDEBUG: lint_message failed for 34 orconn_status
SBSDEBUG: pubsub_adjmap_check failed
SIGABRT: abort
PC=0x7f42638d6a7c m=9 sigcode=18446744073709551610
signal arrived during cgo execution

Mar 21 '23 17:03 bassosimone

I tried to use vanillator instead of torsf. I also forgot I had previously exported:

export CGO_LDFLAGS="-ltsan"

This export turned out to be quite serendipitous (well, maybe?). Here's the crash log:

2023/03/22 08:36:57  info tunnel: tor: exec: <internal/libtor> x/tmp/vanillator/tor Log notice stderr Log notice file x/tmp/vanillator/tor/tor.log
==================
WARNING: ThreadSanitizer: data race (pid=154120)
  Read of size 8 at 0x7b2000000000 by thread T6:
    #0 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 (libtsan.so.0+0x6243e)
    #1 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:819 (libtsan.so.0+0x6243e)
    #2 tor_run_main <null> (testtorsf+0xdf8538)

  Previous write of size 8 at 0x7b2000000000 by main thread:
    #0 calloc ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:672 (libtsan.so.0+0x31edc)
    #1 cstringArrayNew /home/sbs/src/github.com/ooni/probe-cli/internal/libtor/enabled.go:30 (testtorsf+0xdf4c3f)
    #2 _cgo_2d785783cadf_Cfunc_cstringArrayNew /tmp/go-build/cgo-gcc-prolog:65 (testtorsf+0xdf4c3f)

  Location is heap block of size 128 at 0x7b2000000000 allocated by main thread:
    #0 calloc ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:672 (libtsan.so.0+0x31edc)
    #1 cstringArrayNew /home/sbs/src/github.com/ooni/probe-cli/internal/libtor/enabled.go:30 (testtorsf+0xdf4c3f)
    #2 _cgo_2d785783cadf_Cfunc_cstringArrayNew /tmp/go-build/cgo-gcc-prolog:65 (testtorsf+0xdf4c3f)

  Thread T6 (tid=154128, running) created by thread T4 at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:969 (libtsan.so.0+0x605b8)
    #1 _cgo_try_pthread_create /_/runtime/cgo/gcc_libinit.c:100 (testtorsf+0xdf43f0)

SUMMARY: ThreadSanitizer: data race (/home/sbs/src/github.com/ooni/probe-cli/testtorsf+0xdf8538) in tor_run_main
==================
==================
WARNING: ThreadSanitizer: data race (pid=154120)
  Read of size 8 at 0x7b0400001b20 by thread T6:
    #0 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 (libtsan.so.0+0x6243e)
    #1 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:819 (libtsan.so.0+0x6243e)
    #2 tor_run_main <null> (testtorsf+0xdf8611)

  Previous write of size 8 at 0x7b0400001b20 by main thread:
    #0 realloc ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:685 (libtsan.so.0+0x34179)
    #1 cfg_add_owned_arg.isra.0 <null> (testtorsf+0xe46ff1)

  Location is heap block of size 16 at 0x7b0400001b20 allocated by main thread:
    #0 realloc ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:685 (libtsan.so.0+0x34179)
    #1 cfg_add_owned_arg.isra.0 <null> (testtorsf+0xe46ff1)

  Thread T6 (tid=154128, running) created by thread T4 at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:969 (libtsan.so.0+0x605b8)
    #1 _cgo_try_pthread_create /_/runtime/cgo/gcc_libinit.c:100 (testtorsf+0xdf43f0)

SUMMARY: ThreadSanitizer: data race (/home/sbs/src/github.com/ooni/probe-cli/testtorsf+0xdf8611) in tor_run_main
==================
SBSDEBUG: n_sub == 0 for orconn_status
SBSDEBUG: lint_message failed for 16 orconn_status
SBSDEBUG: n_sub == 0 for orconn_state
SBSDEBUG: lint_message failed for 20 orconn_state
SBSDEBUG: n_pub == 0 for orconn_state
SBSDEBUG: lint_message failed for 33 orconn_state
SBSDEBUG: n_pub == 0 for orconn_status
SBSDEBUG: lint_message failed for 34 orconn_status
SBSDEBUG: pubsub_adjmap_check failed
SIGABRT: abort
PC=0x7f3781c6fa7c m=7 sigcode=18446744073709551610
signal arrived during cgo execution

Now, it would be my pleasure to figure out what I am looking at 😬.

Mar 22 '23 08:03 bassosimone

I added https://github.com/ooni/probe-cli/pull/1077/commits/51292a49e41ae9283b8212726c466d90ff1d9d38 and https://github.com/ooni/probe-cli/pull/1077/commits/9e5e96710f9b0f57ee5c8346fece4745375c0560, which solve part of the data race that existed, however, I still see the following in the logs:

[snip]
2023/03/22 10:11:38  info ************* now let's wait a bit ********************************
2023/03/22 10:12:23  info tunnel: tor: exec: <internal/libtor> x/tmp/vanillator/tor Log notice stderr Log notice file x/tmp/vanillator/tor/tor.log
==================
WARNING: ThreadSanitizer: data race (pid=175907)
  Write of size 8 at 0x000001db3b80 by thread T3:
    #0 strncpy ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:740 (libtsan.so.0+0x4a951)
    #1 configure_backtrace_handler <null> (testtorsf+0xe6c8f7)

  Previous write of size 8 at 0x000001db3b80 by thread T6:
    [failed to restore the stack]

  Location is global 'bt_version' of size 128 at 0x000001db3b80 (testtorsf+0x000001db3b80)

  Thread T3 (tid=175912, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:969 (libtsan.so.0+0x605b8)
    #1 _cgo_try_pthread_create /_/runtime/cgo/gcc_libinit.c:100 (testtorsf+0xdf4530)

  Thread T6 (tid=175915, running) created by thread T3 at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:969 (libtsan.so.0+0x605b8)
    #1 _cgo_try_pthread_create /_/runtime/cgo/gcc_libinit.c:100 (testtorsf+0xdf4530)

SUMMARY: ThreadSanitizer: data race (/home/sbs/src/github.com/ooni/probe-cli/testtorsf+0xe6c8f7) in configure_backtrace_handler
==================

[snip]

2023/03/22 10:21:49  info tunnel: tor: exec: <internal/libtor> x/tmp/vanillator/tor Log \
    notice stderr Log notice file x/tmp/vanillator/tor/tor.log
SBSDEBUG: n_sub == 0 for ocirc_state
SBSDEBUG: lint_message failed for 20 ocirc_state
SBSDEBUG: n_pub == 0 for ocirc_state
SBSDEBUG: lint_message failed for 60 ocirc_state
SBSDEBUG: pubsub_adjmap_check failed
SIGABRT: abort
PC=0x7ff4a26efa7c m=3 sigcode=18446744073709551610
signal arrived during cgo execution

I don't fully understand the first race. I am not sure what happens later. I think the main issue at the moment is that I am using the race detector only for C code compiled by Go, due to the export LDFLAGS. It may be wise to consider figuring out (1) a way to reproduce without involving Go code and (2) a way to compile and link tor using tsan. I previously tried to compile tor using tsan with https://github.com/ooni/probe-cli/pull/1077/commits/143373422bfdea467b3768f0f016494aeadc335e, but I got link errors, so I reverted https://github.com/ooni/probe-cli/pull/1077/commits/fc6d0b6e04175722ce77e40cda613c6b009ec02f.

Mar 22 '23 11:03 bassosimone

With https://github.com/ooni/probe-cli/pull/1077/commits/ef7ec1494b3d5f84233006034a1888d63e726393, I wrote a C test case that bypasses Go code and runs tor more or less like we run it from Go. The main difference with that code is that we do not disable networking and then re-enable it as part of the bootstrap of tor. I chose to go down this road because I could not manage to compile tor with tsan and link it with the Go test program using -race. In any case, it seems I am moderately lucky today, since I can still see a crash:

Mar 22 13:21:26.132 [notice] Tor 0.4.7.13 running on Linux with [...]

[snip]

Mar 22 13:25:20.000 [notice] Bootstrapped 100% (done): Done
Mar 22 13:25:56.000 [notice] Owning controller connection has closed -- exiting now.
Mar 22 13:25:56.000 [notice] Catching signal TERM, exiting cleanly.
********** doing another round
SBSDEBUG: n_pub == 0 for ocirc_state
SBSDEBUG: lint_message failed for 7 ocirc_state
SBSDEBUG: n_sub == 0 for orconn_state
SBSDEBUG: lint_message failed for 10 orconn_state
SBSDEBUG: n_sub == 0 for ocirc_state
SBSDEBUG: lint_message failed for 29 ocirc_state
SBSDEBUG: n_pub == 0 for orconn_state
SBSDEBUG: lint_message failed for 32 orconn_state
SBSDEBUG: pubsub_adjmap_check failed

What is very interesting (and unexpected by me) here is that we don't see any data races. I am reasonably sure I have compiled the code using tsan, because:

% ldd a.out
        linux-vdso.so.1 (0x00007fffd0dd3000)
        libtsan.so.0 => /lib/x86_64-linux-gnu/libtsan.so.0 (0x00007f4b5fcd3000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4b5fbec000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4b5f9c4000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f4b5f9a4000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f4b60f53000)

So, it seems it's possible to reduce the problem to just using C code with tor 0.4.7.13. Now, it would probably be useful to further reduce the minimal test case to avoid recompiling all the dependencies (but libtor.a obviously).

It may also be a good idea to check whether the problem has been fixed in the main development branch.

Mar 22 '23 13:03 bassosimone

So, I managed to write a minimal example to reproduce the behavior.

Copying from my notes:

git clone https://gitlab.torproject.org/tpo/core/tor
cd tor
git checkout tor-0.4.7.13
git apply 004.diff where 004.diff is

diff --git a/src/lib/pubsub/pubsub_check.c b/src/lib/pubsub/pubsub_check.c
index 99e604d715..a5cc4b7658 100644
--- a/src/lib/pubsub/pubsub_check.c
+++ b/src/lib/pubsub/pubsub_check.c
@@ -25,6 +25,7 @@
 #include "lib/malloc/malloc.h"
 #include "lib/string/compat_string.h"
 
+#include <stdio.h>
 #include <string.h>
 
 static void pubsub_adjmap_add(pubsub_adjmap_t *map,
@@ -343,21 +344,27 @@ lint_message(const pubsub_adjmap_t *map, message_id_t msg)
     log_warn(LD_MESG|LD_BUG,
              "Message \"%s\" has subscribers, but no publishers.",
              get_message_id_name(msg));
+    fprintf(stderr, "SBSDEBUG: n_pub == 0 for %s\n", get_message_id_name(msg));
     ok = false;
   } else if (n_sub == 0) {
     log_warn(LD_MESG|LD_BUG,
              "Message \"%s\" has publishers, but no subscribers.",
              get_message_id_name(msg));
+    fprintf(stderr, "SBSDEBUG: n_sub == 0 for %s\n", get_message_id_name(msg));
     ok = false;
   }
 
   /* Check the message graph topology. */
-  if (lint_message_graph(map, msg, pub, sub) < 0)
+  if (lint_message_graph(map, msg, pub, sub) < 0) {
+    fprintf(stderr, "SBSDEBUG: lint_message_graph failed for %s\n", get_message_id_name(msg));
     ok = false;
+  }
 
   /* Check whether the messages have the same fields set on them. */
-  if (lint_message_consistency(msg, pub, sub) < 0)
+  if (lint_message_consistency(msg, pub, sub) < 0) {
+    fprintf(stderr, "SBSDEBUG: lint_message_consistency failed for %s\n", get_message_id_name(msg));
     ok = false;
+  }
 
   if (!ok) {
     /* There was a problem -- let's log all the publishers and subscribers on
@@ -385,6 +392,7 @@ pubsub_adjmap_check(const pubsub_adjmap_t *map)
   bool all_ok = true;
   for (unsigned i = 0; i < map->n_msgs; ++i) {
     if (lint_message(map, i) < 0) {
+      fprintf(stderr, "SBSDEBUG: lint_message failed for %u %s\n", i, get_message_id_name((message_id_t)i));
       all_ok = false;
     }
   }
@@ -401,11 +409,15 @@ pubsub_builder_check(pubsub_builder_t *builder)
   pubsub_adjmap_t *map = pubsub_build_adjacency_map(builder->items);
   int rv = -1;
 
-  if (!map)
+  if (!map) {
+    fprintf(stderr, "SBSDEBUG: pubsub_build_adjacency_map failed\n");
     goto err; // should be impossible
+  }
 
-  if (pubsub_adjmap_check(map) < 0)
+  if (pubsub_adjmap_check(map) < 0) {
+    fprintf(stderr, "SBSDEBUG: pubsub_adjmap_check failed\n");
     goto err;
+  }
 
   rv = 0;
  err:

./autogen.sh
./configure --disable-asciidoc
make
mkdir tmp
vi tmp/main.c where main.c contains

#include "../src/feature/api/tor_api.h"

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

static void *threadMain(void *ptr) {
	int *fdp = (int*)ptr;
	(void)sleep(45 /* second */);
	(void)close(*fdp);
	free(fdp);
	return NULL;
}

int main() {
	for (;;) {
		tor_main_configuration_t *config = tor_main_configuration_new();
		if (config == NULL) {
			exit(1);
		}
		char *argv[] = {
			"tor",
			"Log",
			"notice stderr",
			"DataDirectory",
			"./x",
			NULL,
		};
		int argc = 5;
		if (tor_main_configuration_set_command_line(config, argc, argv) != 0) {
			exit(2);
		}
		int filedesc = tor_main_configuration_setup_control_socket(config);
		if (filedesc < 0) {
			exit(3);
		}
		int *fdp = malloc(sizeof(*fdp));
		if (fdp == NULL) {
			exit(4);
		}
		*fdp = filedesc;
		pthread_t thread;
		if (pthread_create(&thread, NULL, threadMain, /* move */ fdp) != 0) {
			exit(5);
		}
		tor_run_main(config);
		if (pthread_join(thread, NULL) != 0) {
			exit(6);
		}
		fprintf(stderr, "********** doing another round\n");
	}
}

gcc -Wall tmp/main.c -L. -ltor -levent -lcrypto -lssl -lz -lm
./a.out 2>&1|tee LOG.txt

When I ran this on a Ubuntu 22.04.2 LTS distro, I got this output:

[...]

Mar 22 14:07:21.000 [notice] Owning controller connection has closed -- exiting now.
Mar 22 14:07:21.000 [notice] Catching signal TERM, exiting cleanly.
********** doing another round
SBSDEBUG: n_sub == 0 for orconn_state
SBSDEBUG: lint_message failed for 5 orconn_state
SBSDEBUG: n_pub == 0 for orconn_state
SBSDEBUG: lint_message failed for 34 orconn_state
SBSDEBUG: pubsub_adjmap_check failed
[1]    300227 IOT instruction (core dumped)  ./a.out 2>&1 | 
       300228 done                           tee LOG.txt

Please, see also the complete LOG.txt file.

However, I must say the error we see here is different from the one I would expect to see. Sure, we see the usual messages indicating something is wrong with the pubsub subsystem, but here we're getting an "IOT instruction", which is different from what we used to see from Go. According to the GNU libc manual, "on most platforms SIGIOT is equivalent to SIGABRT". Also, include/linux/signal.h mentions SIGIOT to be an alias for SIGABRT. Also, GNU libc seems to define SIGIOT as a backward compatibility name for SIGABRT. It remains the mystery of why now the error is "IOT istruction" instead of "SIGABRT: abort" but maybe this is just because Go handles the signal differently? (Does this make sense? Is it possible to register a handler for SIGABRT? So, it seems it's possible to do that but the system would abort anyway after running the signal handler.)

Mar 22 '23 14:03 bassosimone

I have just seen another crash:

Mar 22 15:57:36.000 [notice] Bootstrapped 90% (ap_handshake_done): Handshake finished with a relay to build circuits
Mar 22 15:57:36.000 [notice] Bootstrapped 95% (circuit_create): Establishing a Tor circuit
Mar 22 15:57:36.000 [notice] Bootstrapped 100% (done): Done
Mar 22 15:58:19.000 [notice] Owning controller connection has closed -- exiting now.
Mar 22 15:58:19.000 [notice] Catching signal TERM, exiting cleanly.
********** doing another round
SBSDEBUG: n_pub == 0 for orconn_state
SBSDEBUG: lint_message failed for 23 orconn_state
SBSDEBUG: n_sub == 0 for orconn_state
SBSDEBUG: lint_message failed for 28 orconn_state
SBSDEBUG: pubsub_adjmap_check failed
[1]    313605 IOT instruction (core dumped)  ./a.out 2>&1 | 
       313606 done                           tee LOG.txt

% ldd a.out
        linux-vdso.so.1 (0x00007ffc18fb8000)
        libevent-2.1.so.7 => /lib/x86_64-linux-gnu/libevent-2.1.so.7 (0x00007f4355bec000)
        libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007f43557a9000)
        libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f4355705000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f43556e9000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4355602000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f43553d8000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f4355f49000)

This was after six repetitions again (WTF?!). Also, it seems weird that it previously run for ~1h without crashes after I patched the code to install a signal handler for SIGABRT. I am not sure this matters, though.

Here is the full LOG.txt.

Mar 22 '23 16:03 bassosimone

The next step is probably to patch tor to avoid aborting and see whether that makes any difference.

It's also worth recompiling just tor and the simple main with tsan and see whether that makes any difference.

Mar 22 '23 16:03 bassosimone

What happens if I just prevent tor from aborting?

TL;DR nothing (save for possible memory corruption that I cannot demonstrate at this time).

Please see the LOG-without-aborting.txt, which includes:

Mar 22 16:17:06.000 [notice] Owning controller connection has closed -- exiting now.
Mar 22 16:17:06.000 [notice] Catching signal TERM, exiting cleanly.
********** doing another round
SBSDEBUG: n_pub == 0 for orconn_state
SBSDEBUG: lint_message failed for 9 orconn_state
SBSDEBUG: n_sub == 0 for orconn_state
SBSDEBUG: lint_message failed for 57 orconn_state
SBSDEBUG: pubsub_adjmap_check failed
SBSDEBUG: pubsub_install failed; all bets are off
********** doing another round
Mar 22 16:17:51.000 [notice] Tor 0.4.7.13 (git-7c1601fb6edd780f) running on Linux [...]

The full patch I am using is:

diff --git a/src/app/main/main.c b/src/app/main/main.c
index 838e129d04..6a7c99f9f7 100644
--- a/src/app/main/main.c
+++ b/src/app/main/main.c
@@ -92,6 +92,8 @@
 #include <unistd.h>
 #endif
 
+#include <stdio.h>
+
 #ifdef HAVE_SYSTEMD
 #   if defined(__COVERITY__) && !defined(__INCLUDE_LEVEL__)
 /* Systemd's use of gcc's __INCLUDE_LEVEL__ extension macro appears to confuse
@@ -1279,14 +1281,16 @@ run_tor_main_loop(void)
 }
 
 /** Install the publish/subscribe relationships for all the subsystems. */
-void
+int
 pubsub_install(void)
 {
     pubsub_builder_t *builder = pubsub_builder_new();
     int r = subsystems_add_pubsub(builder);
-    tor_assert(r == 0);
+    if (r != 0) {
+      return r;
+    }
     r = tor_mainloop_connect_pubsub(builder); // consumes builder
-    tor_assert(r == 0);
+    return r;
 }
 
 /** Connect the mainloop to its publish/subscribe message delivery events if
@@ -1331,7 +1335,10 @@ tor_run_main(const tor_main_configuration_t *tor_cfg)
   if (POSSIBLE(done))
     goto done;
 
-  pubsub_install();
+  if (pubsub_install() != 0) {
+    fprintf(stderr, "SBSDEBUG: pubsub_install failed; all bets are off\n");
+    goto done;
+  }
 
   {
     int init_rv = tor_init(argc, argv);
diff --git a/src/app/main/main.h b/src/app/main/main.h
index a8fa0959ab..08ba27253d 100644
--- a/src/app/main/main.h
+++ b/src/app/main/main.h
@@ -25,7 +25,7 @@ int tor_init(int argc, char **argv);
 
 int run_tor_main_loop(void);
 
-void pubsub_install(void);
+int pubsub_install(void);
 void pubsub_connect(void);
 
 #endif /* !defined(TOR_MAIN_H) */
diff --git a/src/lib/pubsub/pubsub_check.c b/src/lib/pubsub/pubsub_check.c
index 99e604d715..a5cc4b7658 100644
--- a/src/lib/pubsub/pubsub_check.c
+++ b/src/lib/pubsub/pubsub_check.c
@@ -25,6 +25,7 @@
 #include "lib/malloc/malloc.h"
 #include "lib/string/compat_string.h"
 
+#include <stdio.h>
 #include <string.h>
 
 static void pubsub_adjmap_add(pubsub_adjmap_t *map,
@@ -343,21 +344,27 @@ lint_message(const pubsub_adjmap_t *map, message_id_t msg)
     log_warn(LD_MESG|LD_BUG,
              "Message \"%s\" has subscribers, but no publishers.",
              get_message_id_name(msg));
+    fprintf(stderr, "SBSDEBUG: n_pub == 0 for %s\n", get_message_id_name(msg));
     ok = false;
   } else if (n_sub == 0) {
     log_warn(LD_MESG|LD_BUG,
              "Message \"%s\" has publishers, but no subscribers.",
              get_message_id_name(msg));
+    fprintf(stderr, "SBSDEBUG: n_sub == 0 for %s\n", get_message_id_name(msg));
     ok = false;
   }
 
   /* Check the message graph topology. */
-  if (lint_message_graph(map, msg, pub, sub) < 0)
+  if (lint_message_graph(map, msg, pub, sub) < 0) {
+    fprintf(stderr, "SBSDEBUG: lint_message_graph failed for %s\n", get_message_id_name(msg));
     ok = false;
+  }
 
   /* Check whether the messages have the same fields set on them. */
-  if (lint_message_consistency(msg, pub, sub) < 0)
+  if (lint_message_consistency(msg, pub, sub) < 0) {
+    fprintf(stderr, "SBSDEBUG: lint_message_consistency failed for %s\n", get_message_id_name(msg));
     ok = false;
+  }
 
   if (!ok) {
     /* There was a problem -- let's log all the publishers and subscribers on
@@ -385,6 +392,7 @@ pubsub_adjmap_check(const pubsub_adjmap_t *map)
   bool all_ok = true;
   for (unsigned i = 0; i < map->n_msgs; ++i) {
     if (lint_message(map, i) < 0) {
+      fprintf(stderr, "SBSDEBUG: lint_message failed for %u %s\n", i, get_message_id_name((message_id_t)i));
       all_ok = false;
     }
   }
@@ -401,11 +409,15 @@ pubsub_builder_check(pubsub_builder_t *builder)
   pubsub_adjmap_t *map = pubsub_build_adjacency_map(builder->items);
   int rv = -1;
 
-  if (!map)
+  if (!map) {
+    fprintf(stderr, "SBSDEBUG: pubsub_build_adjacency_map failed\n");
     goto err; // should be impossible
+  }
 
-  if (pubsub_adjmap_check(map) < 0)
+  if (pubsub_adjmap_check(map) < 0) {
+    fprintf(stderr, "SBSDEBUG: pubsub_adjmap_check failed\n");
     goto err;
+  }
 
   rv = 0;
  err:

Mar 22 '23 16:03 bassosimone

Regarding running the experiment using tsan, I did this last night and let the code run. The code that run had the more comprehensive patch which prevented tor from aborting and kept running. Here's what I learned:

I observed that the CPU usage was very high near the beginning of the bootstrap;
I observed a jump in the bootstrap time which may be correlated to the high CPU usage (not sure):

Mar 22 16:46:32.000 [notice] Bootstrapped 0% (starting): Starting
Mar 22 16:46:45.000 [notice] Starting with guard context "default"

the code bootstrapped 490 times and the "all bets are off" message indicating tor could not correctly build the pubsub adjacency map only occurred twice
I did not see any warning from the thread sanitizer in the logs
I don't know whether the high CPU usage caused by the sanitizer prevented races from happening (TBH, I do not even know whether it's possible or whether the sanitizer would spot races anyway 🙈)

Also:

sbs@ubuntu-4gb-fsn1-1 ~/t/tor> ldd a.out
	linux-vdso.so.1 (0x00007ffc34a8d000)
	libtsan.so.0 => /lib/x86_64-linux-gnu/libtsan.so.0 (0x00007fca1ed0f000)
	libevent-2.1.so.7 => /lib/x86_64-linux-gnu/libevent-2.1.so.7 (0x00007fca1ecbb000)
	libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007fca1e878000)
	libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007fca1e7d4000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fca1e7b8000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fca1e6cf000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fca1e4a7000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fca1e487000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fca1fe42000)

Here's the full log of this run: LOG-tsan-2.txt

Mar 23 '23 09:03 bassosimone

Okay, enough with the investigation. I am not sure how to proceed and I think understanding the problem probably requires a better understanding of C code debugging techniques and of the tor codebase.

I just opened an issue on the tor bug tracker: https://gitlab.torproject.org/tpo/core/tor/-/issues/40774.

Mar 23 '23 09:03 bassosimone

Sentry issue: PROBE-ANDROID-12E

Aug 17 '23 09:08 sentry-io[bot]

We're continuing to discuss with Tor developers at https://gitlab.torproject.org/tpo/core/tor/-/issues/40774#note_2959394.

Oct 31 '23 09:10 bassosimone

probe
probe copied to clipboard

engine: tor pubsub_install tor_raw_abort

What happens if I just prevent tor from aborting?

probe probe copied to clipboard

engine: tor pubsub_install tor_raw_abort

What happens if I just prevent tor from aborting?

probe
probe copied to clipboard