libportal icon indicating copy to clipboard operation
libportal copied to clipboard

intermittent segfault during automated tests

Open smcv opened this issue 1 year ago • 2 comments

While trying to reproduce #166 interactively, I encountered a segfault in the tests.

To reproduce:

  • mkdir _build
  • podman run --rm -it -w $(pwd) -v $(pwd):$(pwd):ro -v $(pwd)/_build:$(pwd)/_build:rw debian:sid-slim
  • in the container:
    • sed -i -e 's/Types:.*/Types: deb deb-src/' /etc/apt/sources.list.d/debian.sources
    • apt update
    • apt upgrade
    • apt build-dep libportal
    • meson setup _build
    • meson compile -C _build
    • meson test -C _build --timeout-multiplier=3 --repeat=20 pytest

Backtrace:

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0)
    at ./nptl/pthread_kill.c:44
#1  0x00007ff323b3347f in __pthread_kill_internal (signo=11, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007ff323ae4532 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3  0x00007ff323ae45d0 in <signal handler called> () at /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ff3218aaa55 in g_task_get_cancellable (task=0x5f6b636f6d737562) at ../../../gio/gtask.c:1275
#5  0x00007ff3216d0610 in call_returned (object=0x2b32cdd0 [GDBusConnection], result=0x2b2b5fe0, data=0x2b341340)
    at ../libportal/inputcapture.c:284
#6  0x00007ff3218aa393 in g_task_return_now (task=task@entry=0x2b2b5fe0 [GTask]) at ../../../gio/gtask.c:1361
#7  0x00007ff3218ab033 in g_task_return (type=<optimized out>, task=0x2b2b5fe0 [GTask]) at ../../../gio/gtask.c:1430
#8  g_task_return (task=0x2b2b5fe0 [GTask], type=<optimized out>) at ../../../gio/gtask.c:1387
#9  0x00007ff321908ce0 in g_dbus_connection_call_done
    (source=0x2b32cdd0 [GDBusConnection], result=<optimized out>, user_data=0x2b2b5fe0)
    at ../../../gio/gdbusconnection.c:6344
#10 0x00007ff3218aa393 in g_task_return_now (task=task@entry=0x2b2f0c50 [GTask]) at ../../../gio/gtask.c:1361
#11 0x00007ff3218aa3cd in complete_in_idle_cb (task=0x2b2f0c50) at ../../../gio/gtask.c:1375
#12 0x00007ff3222447df in g_main_dispatch (context=context@entry=0x2b3628f0) at ../../../glib/gmain.c:3357
#13 0x00007ff322246a17 in g_main_context_dispatch_unlocked (context=0x2b3628f0) at ../../../glib/gmain.c:4208
#14 g_main_context_iterate_unlocked
    (context=0x2b3628f0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at ../../../glib/gmain.c:4273
#15 0x00007ff32224746f in g_main_loop_run (loop=0x2b29ebf0) at ../../../glib/gmain.c:4475
#16 0x00007ff3235953fe in ffi_call_unix64 () at ../src/x86/unix64.S:104
#17 0x00007ff32359470d in ffi_call_int
    (cif=cif@entry=0x2b295828, fn=<optimized out>, rvalue=<optimized out>, avalue=<optimized out>, closure=closure@entry=0x0) at ../src/x86/ffi64.c:673
#18 0x00007ff323594ee3 in ffi_call
    (cif=cif@entry=0x2b295828, fn=<optimized out>, rvalue=rvalue@entry=0x7fffa3321c78, avalue=<optimized out>)
    at ../src/x86/ffi64.c:710
#19 0x00007ff322c1ecf3 in pygi_invoke_c_callable
    (function_cache=<optimized out>, state=<optimized out>, py_args=<optimized out>, py_kwargs=<optimized out>)
    at ../gi/pygi-invoke.c:684

and remaining stack frames are CPython.

It looks like a use-after-free of the Call: all of its pointer members point to inaccessible memory.

smcv avatar Sep 11 '24 09:09 smcv

I think what's happening here is that if the Response signals successful completion before the g_dbus_connection_call that created it has returned, then the Call is freed while it is still the user_data for g_dbus_connection_call. In call_returned(), if g_dbus_connection_call_finish succeeds, then we don't actually dereference the dangling pointer and it's OK (but probably still undefined behaviour); but if g_dbus_connection_call_finish somehow fails (perhaps because it was cancelled or because we disconnect from the bus or something?), we do dereference the pointer to freed memory.

Instead of using the Call as the user data, I think it might be better to use the GTask (which is refcounted) as the user data, and attach the rest of the Call members to it via g_task_set_task_data().

smcv avatar Sep 11 '24 10:09 smcv

I tried inserting some debug logging in functions that interact with the Call and using pytest -s -x (don't capture stdout/stderr, do fail as soon as one test-case fails) to be able to see the log output, but unfortunately that perturbs the timing enough that I can no longer reproduce the segfault, so I'm guessing at the root cause based on the output I saw.

smcv avatar Sep 11 '24 10:09 smcv

I've been given access to a cloud VM instance that was being used for Debian QA, where this crash happens much more often than on Debian's official autobuilders or on my own system. I don't know what combination of factors has caused the stars to have aligned differently (CPU speed to vCPU count ratio? CPU speed to I/O speed ratio? ...) but for whatever reason, it seems that this machine reproduces the crash after a few iterations of repeated testing.

In particular, on this machine I can reproduce the failure under pytest -v -v --log-level=DEBUG -k TestInputCapture -s -x.

I added #184, #185, #186 and some extra debug:

diff
diff --git a/libportal/inputcapture.c b/libportal/inputcapture.c
index 0921709..cd48a1f 100644
--- a/libportal/inputcapture.c
+++ b/libportal/inputcapture.c
@@ -233,6 +233,7 @@ typedef struct {
   /* SetPointerBarrier only */
   GList *barriers;
 
+  int in_flight;
 } Call;
 
 static void create_session (Call *call);
@@ -241,7 +242,9 @@ static void get_zones (Call *call);
 static void
 call_free (Call *call)
 {
+  g_debug ("%s: %p", G_STRFUNC, call);
   g_return_if_fail (G_IS_TASK (call->task));
+  g_return_if_fail (call->in_flight == 0);
 
   /* CreateSesssion */
   if (call->parent)
@@ -277,13 +280,24 @@ call_returned (GObject *object,
   GError *error = NULL;
   g_autoptr(GVariant) ret = NULL;
 
+  g_debug ("%s: %p", G_STRFUNC, call);
+  call->in_flight--;
+  g_debug ("%s: %p now in-flight %d", G_STRFUNC, call, call->in_flight);
+
   ret = g_dbus_connection_call_finish (G_DBUS_CONNECTION (object), result, &error);
   if (error)
     {
+      g_debug ("%s: %p -> error %s #%d %s", G_STRFUNC, call,
+               g_quark_to_string (error->domain), error->code, error->message);
       g_clear_signal_handler (&call->cancelled_id, g_task_get_cancellable (call->task));
+      g_debug ("%s: %p -> returning error", G_STRFUNC, call);
       g_task_return_error (call->task, error);
       call_free (call);
     }
+  else
+    {
+      g_debug ("%s: %p -> success, continue", G_STRFUNC, call);
+    }
 }
 
 static gboolean
@@ -330,8 +344,10 @@ prep_call (Call *call, GDBusSignalCallback callback, GVariantBuilder *options, v
 {
   g_autofree char *token = NULL;
 
+  g_debug ("%s: %p", G_STRFUNC, call);
   token = g_strdup_printf ("portal%d", g_random_int_range (0, G_MAXINT));
   call->request_path = g_strconcat (REQUEST_PATH_PREFIX, call->portal->sender, "/", token, NULL);
+  g_debug ("%s: %p subscribing, already=%u", G_STRFUNC, call, call->signal_id);
   call->signal_id = g_dbus_connection_signal_subscribe (call->portal->bus,
                                                         PORTAL_BUS_NAME,
                                                         REQUEST_INTERFACE,
@@ -491,6 +507,7 @@ get_zones_done (GDBusConnection *bus,
   g_autoptr(GVariant) ret = NULL;
 
   g_variant_get (parameters, "(u@a{sv})", &response, &ret);
+  g_debug ("%s: %p -> response code %d", G_STRFUNC, call, response);
 
   if (response != 0)
     g_clear_signal_handler (&call->cancelled_id, g_task_get_cancellable (call->task));
@@ -501,6 +518,7 @@ get_zones_done (GDBusConnection *bus,
       guint32 zone_set;
       XdpInputCaptureSession *session = call->session;
 
+      g_debug ("%s: %p unsubscribing", G_STRFUNC, call);
       g_dbus_connection_signal_unsubscribe (call->portal->bus, call->signal_id);
       call->signal_id = 0;
 
@@ -560,6 +578,7 @@ get_zones_done (GDBusConnection *bus,
           g_variant_lookup (ret, "zones", "@a(uuii)", &zones))
         {
           set_zones (session, zones, zone_set);
+          g_debug ("%s: %p returning Session %p", G_STRFUNC, call, session);
           g_task_return_pointer (call->task, session, g_object_unref);
         }
       else
@@ -575,7 +594,10 @@ get_zones_done (GDBusConnection *bus,
     g_task_return_new_error (call->task, G_IO_ERROR, G_IO_ERROR_FAILED, "InputCapture GetZones() failed");
 
   if (response != 0)
-    call_free (call);
+    {
+      g_debug ("%s: %p -> returned error", G_STRFUNC, call);
+      call_free (call);
+    }
 }
 
 static void
@@ -589,6 +611,9 @@ get_zones (Call *call)
   session_id = call->session ? call->session->parent_session->id : call->session_path;
 
   prep_call (call, get_zones_done, &options, NULL);
+  g_debug ("%s: %p calling GetZones", G_STRFUNC, call);
+  call->in_flight++;
+  g_debug ("%s: %p now in-flight %d", G_STRFUNC, call, call->in_flight);
   g_dbus_connection_call (call->portal->bus,
                           PORTAL_BUS_NAME,
                           PORTAL_OBJECT_PATH,
@@ -617,12 +642,14 @@ session_created (GDBusConnection *bus,
   g_autoptr(GVariant) ret = NULL;
 
   g_variant_get (parameters, "(u@a{sv})", &response, &ret);
+  g_debug ("%s: %p -> response code %d", G_STRFUNC, call, response);
 
   if (response != 0)
     g_clear_signal_handler (&call->cancelled_id, g_task_get_cancellable (call->task));
 
   if (response == 0)
     {
+      g_debug ("%s: %p unsubscribing", G_STRFUNC, call);
       g_dbus_connection_signal_unsubscribe (call->portal->bus, call->signal_id);
       call->signal_id = 0;
 
@@ -640,7 +667,10 @@ session_created (GDBusConnection *bus,
     g_task_return_new_error (call->task, G_IO_ERROR, G_IO_ERROR_FAILED, "CreateSession failed");
 
   if (response != 0)
-    call_free (call);
+    {
+      g_debug ("%s: %p -> returned error", G_STRFUNC, call);
+      call_free (call);
+    }
 }
 
 static void
@@ -649,6 +679,7 @@ call_cancelled_cb (GCancellable *cancellable,
 {
   Call *call = data;
 
+  g_debug ("%s: %p", G_STRFUNC, call);
   g_dbus_connection_call (call->portal->bus,
                           PORTAL_BUS_NAME,
                           call->request_path,
@@ -667,6 +698,8 @@ parent_exported (XdpParent *parent,
                  gpointer data)
 {
   Call *call = data;
+
+  g_debug ("%s: %p", G_STRFUNC, call);
   call->parent_handle = g_strdup (handle);
   create_session (call);
 }
@@ -678,8 +711,11 @@ create_session (Call *call)
   g_autofree char *session_token = NULL;
   GCancellable *cancellable;
 
+  g_debug ("%s: %p", G_STRFUNC, call);
+
   if (call->parent_handle == NULL)
     {
+      g_debug ("%s: %p needs to export parent first", G_STRFUNC, call);
       call->parent->parent_export (call->parent, parent_exported, call);
       return;
     }
@@ -694,6 +730,9 @@ create_session (Call *call)
   g_variant_builder_add (&options, "{sv}", "session_handle_token", g_variant_new_string (session_token));
   g_variant_builder_add (&options, "{sv}", "capabilities", g_variant_new_uint32 (call->capabilities));
 
+  g_debug ("%s: %p calling CreateSession", G_STRFUNC, call);
+  call->in_flight++;
+  g_debug ("%s: %p now in-flight %d", G_STRFUNC, call, call->in_flight);
   g_dbus_connection_call (call->portal->bus,
                           PORTAL_BUS_NAME,
                           PORTAL_OBJECT_PATH,
@@ -737,6 +776,7 @@ xdp_portal_create_input_capture_session (XdpPortal *portal,
   call = g_new0 (Call, 1);
   call->portal = g_object_ref (portal);
   call->task = g_task_new (portal, cancellable, callback, data);
+  g_debug ("%s: new call %p", G_STRFUNC, call);
 
   if (parent)
     call->parent = xdp_parent_copy (parent);
@@ -771,6 +811,7 @@ xdp_portal_create_input_capture_session_finish (XdpPortal *portal,
   g_return_val_if_fail (g_task_is_valid (result, portal), NULL);
 
   session = g_task_propagate_pointer (G_TASK (result), error);
+  g_debug ("%s -> %p", G_STRFUNC, session);
 
   if (session)
     return session;
@@ -952,6 +993,7 @@ set_pointer_barriers (Call *call)
   GVariantBuilder barriers;
   g_autoptr(GVariantType) vtype = NULL;
 
+  g_debug ("%s: %p", G_STRFUNC, call);
   prep_call (call, set_pointer_barriers_done, &options, NULL);
 
   vtype = g_variant_type_new ("aa{sv}");
@@ -959,6 +1001,9 @@ set_pointer_barriers (Call *call)
   g_variant_builder_init (&barriers, vtype);
   g_list_foreach (call->barriers, convert_barrier, &barriers);
 
+  g_debug ("%s: %p calling SetPointerBarriers", G_STRFUNC, call);
+  call->in_flight++;
+  g_debug ("%s: %p now in-flight %d", G_STRFUNC, call, call->in_flight);
   g_dbus_connection_call (call->portal->bus,
                           PORTAL_BUS_NAME,
                           PORTAL_OBJECT_PATH,
@@ -1023,6 +1068,7 @@ xdp_input_capture_session_set_pointer_barriers (XdpInputCaptureSession *session,
   call->portal = g_object_ref (portal);
   call->session = g_object_ref (session);
   call->task = g_task_new (session, cancellable, callback, data);
+  g_debug ("%s: new call %p", G_STRFUNC, call);
   call->barriers = barriers;
 
   set_pointer_barriers (call);
@@ -1048,6 +1094,7 @@ xdp_input_capture_session_set_pointer_barriers_finish (XdpInputCaptureSession  *
   g_return_val_if_fail (_xdp_input_capture_session_is_valid (session), NULL);
   g_return_val_if_fail (g_task_is_valid (result, session), NULL);
 
+  g_debug ("%s", G_STRFUNC);
   return g_task_propagate_pointer (G_TASK (result), error);
 }
 
@@ -1065,6 +1112,7 @@ xdp_input_capture_session_enable (XdpInputCaptureSession *session)
   GVariantBuilder options;
 
   g_return_if_fail (_xdp_input_capture_session_is_valid (session));
+  g_debug ("%s", G_STRFUNC);
 
   portal = session->parent_session->portal;
 
@@ -1099,6 +1147,7 @@ xdp_input_capture_session_disable (XdpInputCaptureSession *session)
   GVariantBuilder options;
 
   g_return_if_fail (_xdp_input_capture_session_is_valid (session));
+  g_debug ("%s", G_STRFUNC);
 
   g_variant_builder_init (&options, G_VARIANT_TYPE_VARDICT);
 
@@ -1130,6 +1179,7 @@ release_session (XdpInputCaptureSession *session,
   GVariantBuilder options;
 
   g_return_if_fail (_xdp_input_capture_session_is_valid (session));
+  g_debug ("%s", G_STRFUNC);
 
   g_variant_builder_init (&options, G_VARIANT_TYPE_VARDICT);
   g_variant_builder_add (&options, "{sv}", "activation_id", g_variant_new_uint32 (activation_id));
@@ -1170,6 +1220,7 @@ xdp_input_capture_session_release (XdpInputCaptureSession *session,
                                    guint                   activation_id)
 {
   g_return_if_fail (_xdp_input_capture_session_is_valid (session));
+  g_debug ("%s", G_STRFUNC);
 
   release_session (session, activation_id, FALSE, 0, 0);
 }
@@ -1190,6 +1241,7 @@ xdp_input_capture_session_release_at (XdpInputCaptureSession *session,
                                       gdouble                 cursor_y_position)
 {
   g_return_if_fail (_xdp_input_capture_session_is_valid (session));
+  g_debug ("%s", G_STRFUNC);
 
   release_session (session, activation_id, TRUE, cursor_x_position, cursor_y_position);
 }

and the resulting log indicates that my theory in https://github.com/flatpak/libportal/issues/169#issuecomment-2343256907 is likely to have been correct:

pyportaltest/test_inputcapture.py::TestInputCapture::test_session_create_no_serial_on_getzones DEBUG | templates.inputcapture: Loading parameters: dbus.Dictionary({dbus.String('zone-set'): dbus.Int32(-1, variant_level=1)}, signature=dbus.Signature('sv'))
(process:69833): libportal-DEBUG: 00:19:04.025: xdp_portal_create_input_capture_session: new call 0x2b026a30
(process:69833): libportal-DEBUG: 00:19:04.025: create_session: 0x2b026a30
(process:69833): libportal-DEBUG: 00:19:04.025: prep_call: 0x2b026a30
(process:69833): libportal-DEBUG: 00:19:04.025: prep_call: 0x2b026a30 subscribing, already=0
(process:69833): libportal-DEBUG: 00:19:04.025: create_session: 0x2b026a30 calling CreateSession
(process:69833): libportal-DEBUG: 00:19:04.025: create_session: 0x2b026a30 now in-flight 1
DEBUG | templates: Request created at /org/freedesktop/portal/desktop/request/1_64/portal180914104
DEBUG | templates: Session created at /org/freedesktop/portal/desktop/session/1_64/portal224463642
DEBUG | templates.inputcapture: CreateSession with response Response(response=0, results={'capabilities': dbus.UInt32(15, variant_level=1), 'session_handle': dbus.ObjectPath('/org/freedesktop/portal/desktop/session/1_64/portal224463642')})
DEBUG | templates: Request.Response on /org/freedesktop/portal/desktop/request/1_64/portal180914104: Response(response=0, results={'capabilities': dbus.UInt32(15, variant_level=1), 'session_handle': dbus.ObjectPath('/org/freedesktop/portal/desktop/session/1_64/portal224463642')})
(process:69833): libportal-DEBUG: 00:19:04.027: session_created: 0x2b026a30 -> response code 0
(process:69833): libportal-DEBUG: 00:19:04.027: session_created: 0x2b026a30 unsubscribing
(process:69833): libportal-DEBUG: 00:19:04.027: prep_call: 0x2b026a30
(process:69833): libportal-DEBUG: 00:19:04.027: prep_call: 0x2b026a30 subscribing, already=0
(process:69833): libportal-DEBUG: 00:19:04.027: get_zones: 0x2b026a30 calling GetZones
(process:69833): libportal-DEBUG: 00:19:04.027: get_zones: 0x2b026a30 now in-flight 2
(process:69833): libportal-DEBUG: 00:19:04.027: call_returned: 0x2b026a30
(process:69833): libportal-DEBUG: 00:19:04.027: call_returned: 0x2b026a30 now in-flight 1
(process:69833): libportal-DEBUG: 00:19:04.027: call_returned: 0x2b026a30 -> success, continue
DEBUG | templates: Request created at /org/freedesktop/portal/desktop/request/1_64/portal1697371653
DEBUG | templates.inputcapture: GetZones with response Response(response=0, results={'zones': dbus.Array([dbus.Struct((1920, 1080, 0, 0), signature=dbus.Signature('uuii'))], signature=dbus.Signature('(uuii)'), variant_level=1)})
DEBUG | templates: Request.Response on /org/freedesktop/portal/desktop/request/1_64/portal1697371653: Response(response=0, results={'zones': dbus.Array([dbus.Struct((1920, 1080, 0, 0), signature=dbus.Signature('uuii'))], signature=dbus.Signature('(uuii)'), variant_level=1)})
(process:69833): libportal-DEBUG: 00:19:04.028: get_zones_done: 0x2b026a30 -> response code 0
(process:69833): libportal-DEBUG: 00:19:04.028: get_zones_done: 0x2b026a30 unsubscribing

(process:69833): libportal-WARNING **: 00:19:04.028: Faulty portal implementation, missing GetZone's zone_set or zones
(process:69833): libportal-DEBUG: 00:19:04.028: xdp_portal_create_input_capture_session_finish -> (nil)
(process:69833): libportal-DEBUG: 00:19:04.028: get_zones_done: 0x2b026a30 -> returned error
(process:69833): libportal-DEBUG: 00:19:04.028: call_free: 0x2b026a30

(process:69833): libportal-CRITICAL **: 00:19:04.028: call_free: assertion 'call->in_flight == 0' failed

If I remove the assertion g_return_if_fail (call->in_flight == 0), then I can see that call_free() is called (with the CreateSession() call still waiting for its reply) and the Call is freed; but then we get a result for CreateSession() (because the mock portal service disconnected from the bus without replying to CreateSession(), if I'm reading correctly) which has the Call as its user data, which accesses freed memory, which goes about as well as you might expect.

I think this is going to need some sort of reference-counting (so that the Call stays alive for as long as there is at least one GDBus method call in-flight with it as the user-data), and perhaps separating the concept of "the Call is still ongoing" from "the Call has not been freed" (so that when the Call is no longer useful, it unsubscribes from signals and generally cancels as much as possible, but it remains a valid pointer in some sort of "tombstone" state for as long as it is the user-data on at least one GDBus method call).

smcv avatar Feb 10 '25 00:02 smcv