citus
citus copied to clipboard
Citus may have lead to PANIC in Abort[Sub]Transaction - Investigate a Abort[Sub]Transaction functions for possible bugs
The following coredump generated with a Postgres server where Citus is installed. My initial reaction was that this core-dump doesn't have Citus functions involved, hence probably not relevant to Citus.
However, talking with Andres, he thinks that this is a core-dump likely lead by Citus (or any other extension). The server is PG 11.10.
The crash is generated when more than 5 exceptions are thrown on AbortTransaction callbacks. It is OK to throw an error on AbortTransction, but we should only be throwing the error if the underlying problem has been solved / can be solved. If the underlying problem is not solved, and we throw more than 5 errors via recursive calls during AbortTransction, Postgres cannot handle anymore and leads to PANIC.
(gdb) bt
#0 0x00007f85f7c33438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f85f7c3503a in __GI_abort () at abort.c:89
#2 0x000000000084f169 in errfinish (dummy=<optimized out>) at elog.c:555
#3 0x00000000008534a7 in elog_start (filename=filename@entry=0x89de97 "xact.c", lineno=lineno@entry=4785, funcname=funcname@entry=0x8a1c30 <__func__.28854> "AbortSubTransaction")
at elog.c:1319
#4 0x00000000004faa05 in AbortSubTransaction () at xact.c:4784
#5 0x00000000004faf75 in AbortCurrentTransaction () at xact.c:3163
#6 0x0000000000745785 in PostgresMain (argc=1, argv=argv@entry=XXX, dbname=xXXXXX", username=0x23ffbc8 "XXX") at postgres.c:3968
#7 0x0000000000481677 in BackendRun (port=0x23b7230) at postmaster.c:4402
#8 BackendStartup (port=0x23b7230) at postmaster.c:4066
#9 ServerLoop () at postmaster.c:1728
#10 0x00000000006d768b in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xdbc220) at postmaster.c:1401
#11 0x00000000004825ab in main (argc=1, argv=0xdbc220) at main.c:228
As a side node, if Citus hits to an error on AbortTransaction callbacks, it is often better to throw PANIC at that point unless the problem can be resolved. Because, with this backtrace, we lost all the information about what is causing this core-dump .
Another relevant core-dump:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000003d3a620 in ?? ()
(gdb) bt
#0 0x0000000003d3a620 in ?? ()
#1 0x00007f285d634ee8 in plpgsql_subxact_cb (event=<optimized out>,
mySubid=2, parentSubid=<optimized out>, arg=<optimized out>)
at pl_exec.c:8263
#2 0x00000000004faad2 in CallSubXactCallbacks (parentSubid=1, mySubid=2,
event=SUBXACT_EVENT_ABORT_SUB) at xact.c:3456
#3 AbortSubTransaction () at xact.c:4827
#4 0x00000000004faf75 in AbortCurrentTransaction () at xact.c:3163
#5 0x0000000000745785 in PostgresMain (argc=1, argv=argv@entry=0x31d45d0,
dbname=0x31d4500 "XXXX", username=0x31d44e8 "XXXXX") at postgres.c:3968
#6 0x0000000000481677 in BackendRun (port=0x3195cf0) at postmaster.c:4402
#7 BackendStartup (port=0x3195cf0) at postmaster.c:4066
#8 ServerLoop () at postmaster.c:1728
#9 0x00000000006d768b in PostmasterMain (argc=argc@entry=1,
argv=argv@entry=0x229e330) at postmaster.c:1401
#10 0x00000000004825ab in main (argc=1, argv=0x229e330) at main.c:228
Which is very likely that plpgsql procedures/functions and/or savepoints are involved in this crash
We saw this in one more cluster as well
Seen one more cluster, this is roughly what happens, but couldn't repro the segfault:
CREATE OR REPLACE PROCEDURE my_proc()
LANGUAGE plpgsql
SECURITY DEFINER
AS $procedure$
Begin
RAISE EXCEPTION 'error'; -- there is a call to another procedure here that throws an error here, so mimic the behavior
exception
when others then perform pg_sleep(0.1);
raise; -- probably crashes here
end
$procedure$;
call my_proc();
Open question: Do we need something like CommitContext
in subXact callbacks as well?