citus icon indicating copy to clipboard operation
citus copied to clipboard

Citus may have lead to PANIC in Abort[Sub]Transaction - Investigate a Abort[Sub]Transaction functions for possible bugs

Open onderkalaci opened this issue 3 years ago • 4 comments

The following coredump generated with a Postgres server where Citus is installed. My initial reaction was that this core-dump doesn't have Citus functions involved, hence probably not relevant to Citus.

However, talking with Andres, he thinks that this is a core-dump likely lead by Citus (or any other extension). The server is PG 11.10.

The crash is generated when more than 5 exceptions are thrown on AbortTransaction callbacks. It is OK to throw an error on AbortTransction, but we should only be throwing the error if the underlying problem has been solved / can be solved. If the underlying problem is not solved, and we throw more than 5 errors via recursive calls during AbortTransction, Postgres cannot handle anymore and leads to PANIC.

(gdb) bt
#0 0x00007f85f7c33438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f85f7c3503a in __GI_abort () at abort.c:89
#2 0x000000000084f169 in errfinish (dummy=<optimized out>) at elog.c:555
#3 0x00000000008534a7 in elog_start (filename=filename@entry=0x89de97 "xact.c", lineno=lineno@entry=4785, funcname=funcname@entry=0x8a1c30 <__func__.28854> "AbortSubTransaction")
at elog.c:1319
#4 0x00000000004faa05 in AbortSubTransaction () at xact.c:4784
#5 0x00000000004faf75 in AbortCurrentTransaction () at xact.c:3163
#6 0x0000000000745785 in PostgresMain (argc=1, argv=argv@entry=XXX, dbname=xXXXXX", username=0x23ffbc8 "XXX") at postgres.c:3968
#7 0x0000000000481677 in BackendRun (port=0x23b7230) at postmaster.c:4402
#8 BackendStartup (port=0x23b7230) at postmaster.c:4066
#9 ServerLoop () at postmaster.c:1728
#10 0x00000000006d768b in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xdbc220) at postmaster.c:1401
#11 0x00000000004825ab in main (argc=1, argv=0xdbc220) at main.c:228

As a side node, if Citus hits to an error on AbortTransaction callbacks, it is often better to throw PANIC at that point unless the problem can be resolved. Because, with this backtrace, we lost all the information about what is causing this core-dump .

onderkalaci avatar May 26 '21 09:05 onderkalaci

Another relevant core-dump:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000003d3a620 in ?? ()
(gdb) bt
#0  0x0000000003d3a620 in ?? ()
#1  0x00007f285d634ee8 in plpgsql_subxact_cb (event=<optimized out>,
    mySubid=2, parentSubid=<optimized out>, arg=<optimized out>)
    at pl_exec.c:8263
#2  0x00000000004faad2 in CallSubXactCallbacks (parentSubid=1, mySubid=2,
    event=SUBXACT_EVENT_ABORT_SUB) at xact.c:3456
#3  AbortSubTransaction () at xact.c:4827
#4  0x00000000004faf75 in AbortCurrentTransaction () at xact.c:3163
#5  0x0000000000745785 in PostgresMain (argc=1, argv=argv@entry=0x31d45d0,
    dbname=0x31d4500 "XXXX", username=0x31d44e8 "XXXXX") at postgres.c:3968
#6  0x0000000000481677 in BackendRun (port=0x3195cf0) at postmaster.c:4402
#7  BackendStartup (port=0x3195cf0) at postmaster.c:4066
#8  ServerLoop () at postmaster.c:1728
#9  0x00000000006d768b in PostmasterMain (argc=argc@entry=1,
    argv=argv@entry=0x229e330) at postmaster.c:1401
#10 0x00000000004825ab in main (argc=1, argv=0x229e330) at main.c:228

Which is very likely that plpgsql procedures/functions and/or savepoints are involved in this crash

onderkalaci avatar May 26 '21 09:05 onderkalaci

We saw this in one more cluster as well

onderkalaci avatar Sep 27 '21 13:09 onderkalaci

Seen one more cluster, this is roughly what happens, but couldn't repro the segfault:

CREATE OR REPLACE PROCEDURE my_proc()
  LANGUAGE plpgsql
  SECURITY DEFINER
AS $procedure$
   Begin     
     RAISE EXCEPTION  'error'; -- there is a call to another procedure here that throws an error here, so mimic the behavior
   exception
    when others then perform pg_sleep(0.1);
    raise; -- probably crashes here
   end
$procedure$;

call my_proc();

onurctirtir avatar Jul 27 '22 13:07 onurctirtir

Open question: Do we need something like CommitContext in subXact callbacks as well?

onderkalaci avatar Jul 27 '22 15:07 onderkalaci