citus icon indicating copy to clipboard operation
citus copied to clipboard

[SEGFAULT] Querying distributed tables with window partition causes segmentation fault

Open superbug1000 opened this issue 1 year ago • 5 comments

Running the following SQL query always causes a segmentation fault in Citus 12 with Postgres 16:

CREATE TABLE gianluca_camp_test(
    id BYTEA,
    account_id BYTEA NOT NULL,
    start_timestamp INT,
    PRIMARY KEY(account_id, id)
);

SELECT create_distributed_table('gianluca_acc_test', 'id');
SELECT create_distributed_table('gianluca_camp_test', 'account_id');

INSERT INTO gianluca_acc_test VALUES ('\x5b3f8be2f3fd8f25cf402b46'),('\x5aafffc481d8c13b05606f03');
INSERT INTO gianluca_camp_test VALUES
('\x3b3f8be2f3fd8f25cf402123','\x5aafffc481d8c13b05606f03',10),
('\x3b3f8be2f3fd8f25cf402456','\x5b3f8be2f3fd8f25cf402b46',20),
('\x3b3f8be2f3fd8f25cf402789','\x5aafffc481d8c13b05606f03',NULL);

SELECT
    DISTINCT "gianluca_acc_test"."id",
    MAX("gianluca_camp_test"."start_timestamp") OVER (PARTITION BY "gianluca_camp_test"."id")
FROM
    "gianluca_acc_test"
    LEFT OUTER JOIN "gianluca_camp_test" ON "gianluca_acc_test"."id" = "gianluca_camp_test"."account_id"
WHERE
    "gianluca_acc_test"."id" IN (
        '\x5b3f8be2f3fd8f25cf402b46',
        '\x5aafffc481d8c13b05606f03'
    );

The query executes successfully if:

  • The window partition is removed
  • Only one of the 2 IDs is included in the filter
  • Postgres 14 is used
  • The tables are not distributed
  • The join is removed, like so:
SELECT
    DISTINCT "gianluca_camp_test"."account_id",
    MAX("gianluca_camp_test"."start_timestamp") OVER (PARTITION BY "gianluca_camp_test"."id")
FROM
    "gianluca_camp_test"
WHERE
    "gianluca_camp_test"."account_id" IN (
        '\x5b3f8be2f3fd8f25cf402b46',
        '\x5aafffc481d8c13b05606f03'
    );

Logs:

2024-10-15 08:13:43.555 UTC [8571] LOG:  server process (PID 12396) was terminated by signal 11: Segmentation fault
2024-10-15 08:13:43.555 UTC [8571] DETAIL:  Failed process was running: SELECT
            DISTINCT "gianluca_acc_test"."id",
            MAX("gianluca_camp_test"."start_timestamp") OVER (PARTITION BY encode("gianluca_camp_test"."id", 'hex'))
        FROM
            "gianluca_acc_test"
            LEFT OUTER JOIN "gianluca_camp_test" ON "gianluca_acc_test"."id" = "gianluca_camp_test"."account_id"
        WHERE
            "gianluca_acc_test"."id" IN (
                '\x5b3f8be2f3fd8f25cf402b46',
                '\x5aafffc481d8c13b05606f03'
            );
2024-10-15 08:13:43.555 UTC [8571] LOG:  terminating any other active server processes
2024-10-15 08:13:43.556 UTC [12428] postgres@postgres FATAL:  the database system is in recovery mode
2024-10-15 08:13:43.558 UTC [8571] LOG:  all server processes terminated; reinitializing
2024-10-15 08:13:43.599 UTC [12429] LOG:  database system was interrupted; last known up at 2024-10-15 08:09:03 UTC
2024-10-15 08:13:44.085 UTC [12429] LOG:  database system was not properly shut down; automatic recovery in progress
2024-10-15 08:13:44.093 UTC [12429] LOG:  redo starts at 0/21B52F0
2024-10-15 08:13:44.097 UTC [12429] LOG:  invalid record length at 0/22BC3B0: expected at least 24, got 0
2024-10-15 08:13:44.097 UTC [12429] LOG:  redo done at 0/22BC378 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2024-10-15 08:13:44.112 UTC [12430] LOG:  checkpoint starting: end-of-recovery immediate wait
2024-10-15 08:13:44.223 UTC [12430] LOG:  checkpoint complete: wrote 182 buffers (1.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.015 s, sync=0.069 s, total=0.120 s; sync files=154, longest=0.007 s, average=0.001 s; distance=1052 kB, estimate=1052 kB; lsn=0/22BC3B0, redo lsn=0/22BC3B0

superbug1000 avatar Oct 15 '24 08:10 superbug1000

the versions used on Ubuntu 22.04 are:

ii  postgresql-16-citus-12.1-dbgsym  12.1.5.citus-1                          amd64        debug symbols for postgresql-16-citus-12.1
ii  postgresql-16-dbgsym             16.4-1.pgdg22.04+2                      amd64        debug symbols for postgresql-16

and here's the stack trace for the segfault, triggered by the test described by @superbug1000 above:

 continue
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000555afa8bae34 in get_eclass_for_sort_expr (root=0x555afc3589c0, expr=0x555afc359218, opfamilies=0x555afc359c68, opcintype=<optimized out>, collation=<optimized out>, sortref=<optimized out>, rel=0x0, create_it=true) at optimizer/path/./build/../src/backend/optimizer/path/equivclass.c:737
737	optimizer/path/./build/../src/backend/optimizer/path/equivclass.c: No such file or directory.
(gdb) bt
#0  0x0000555afa8bae34 in get_eclass_for_sort_expr (root=0x555afc3589c0, expr=0x555afc359218, opfamilies=0x555afc359c68, opcintype=<optimized out>, collation=<optimized out>,
    sortref=<optimized out>, rel=0x0, create_it=true) at optimizer/path/./build/../src/backend/optimizer/path/equivclass.c:737
#1  0x0000555afa8ca5d0 in make_pathkey_from_sortinfo (root=0x555afc3589c0, expr=0x555afc359218, opfamily=1976, opcintype=23, collation=0, reverse_sort=<optimized out>,
    nulls_first=false, sortref=2, rel=0x0, create_it=true) at optimizer/path/./build/../src/backend/optimizer/path/pathkeys.c:233
#2  0x0000555afa8cb147 in make_pathkey_from_sortop (create_it=true, sortref=2, nulls_first=<optimized out>, ordering_op=<optimized out>, expr=0x555afc359218, root=0x555afc3589c0)
    at optimizer/path/./build/../src/backend/optimizer/path/pathkeys.c:275
#3  make_pathkeys_for_sortclauses_extended (root=0x555afc3589c0, sortclauses=0x555afc358bc0, tlist=0x555afc3590f8, remove_redundant=true, sortable=0x7ffe58dbf550)
    at optimizer/path/./build/../src/backend/optimizer/path/pathkeys.c:1190
#4  0x0000555afa8e1c5a in standard_qp_callback (root=<optimized out>, extra=<optimized out>) at optimizer/plan/./build/../src/backend/optimizer/plan/planner.c:3490
#5  0x0000555afa8d9da3 in query_planner (root=0x555afc3589c0, qp_callback=0x555afa8e1b00 <standard_qp_callback>, qp_extra=0x7ffe58dbf6d0)
    at optimizer/plan/./build/../src/backend/optimizer/plan/planmain.c:207
#6  0x0000555afa8dc5df in grouping_planner (tuple_fraction=<optimized out>, root=<optimized out>) at optimizer/plan/./build/../src/backend/optimizer/plan/planner.c:1495
#7  subquery_planner (glob=<optimized out>, parse=<optimized out>, parent_root=<optimized out>, hasRecursion=<optimized out>, tuple_fraction=<optimized out>)
    at optimizer/plan/./build/../src/backend/optimizer/plan/planner.c:1064
#8  0x0000555afa8de3c1 in standard_planner (parse=parse@entry=0x555afc357978, query_string=query_string@entry=0x0, cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0)
    at optimizer/plan/./build/../src/backend/optimizer/plan/planner.c:413
#9  0x00007f41fd9e8420 in BuildSelectStatementViaStdPlanner (combineQuery=combineQuery@entry=0x555afc357978, remoteScanTargetList=remoteScanTargetList@entry=0x555afc358358,
    remoteScan=remoteScan@entry=0x555afc357b88) at ./src/backend/distributed/planner/combine_query_planner.c:304
#10 0x00007f41fd9ea890 in PlanCombineQuery (distributedPlan=<optimized out>, distributedPlan=<optimized out>, remoteScan=0x555afc357b88)
    at ./src/backend/distributed/planner/combine_query_planner.c:117
#11 FinalizeNonRouterPlan (distributedPlan=<optimized out>, distributedPlan=<optimized out>, customScan=0x555afc357b88, localPlan=0x555afc3062b8)
    at ./src/backend/distributed/planner/distributed_planner.c:1489
#12 FinalizePlan (localPlan=0x555afc3062b8, distributedPlan=<optimized out>) at ./src/backend/distributed/planner/distributed_planner.c:1470
#13 0x00007f41fd9ee8e1 in CreateDistributedPlannedStmt (planContext=planContext@entry=0x7ffe58dbfb50) at ./src/backend/distributed/planner/distributed_planner.c:800
#14 0x00007f41fd9eec4e in PlanDistributedStmt (rteIdCounter=<optimized out>, planContext=0x7ffe58dbfb50) at ./src/backend/distributed/planner/distributed_planner.c:685
#15 distributed_planner (parse=0x555afc176f60,
    query_string=0x555afc1754d0 "SELECT\n    DISTINCT \"gianluca_acc_test\".\"id\",\n    MAX(\"gianluca_camp_test\".\"start_timestamp\") OVER (PARTITION BY \"gianluca_camp_test\".\"id\")\nFROM\n    \"gianluca_acc_test\"\n    LEFT OUTER JOIN \"gianluca_c"..., cursorOptions=<optimized out>, boundParams=<optimized out>)
    at ./src/backend/distributed/planner/distributed_planner.c:277
#16 0x0000555afa9be7b9 in planner (boundParams=<optimized out>, cursorOptions=<optimized out>,
    query_string=0x555afc1754d0 "SELECT\n    DISTINCT \"gianluca_acc_test\".\"id\",\n    MAX(\"gianluca_camp_test\".\"start_timestamp\") OVER (PARTITION BY \"gianluca_camp_test\".\"id\")\nFROM\n    \"gianluca_acc_test\"\n    LEFT OUTER JOIN \"gianluca_c"..., parse=0x555afc176f60) at optimizer/plan/./build/../src/backend/optimizer/plan/planner.c:279
#17 pg_plan_query (querytree=0x555afc176f60,
    query_string=query_string@entry=0x555afc1754d0 "SELECT\n    DISTINCT \"gianluca_acc_test\".\"id\",\n    MAX(\"gianluca_camp_test\".\"start_timestamp\") OVER (PARTITION BY \"gianluca_camp_test\".\"id\")\nFROM\n    \"gianluca_acc_test\"\n    LEFT OUTER JOIN \"gianluca_c"..., cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
    at tcop/./build/../src/backend/tcop/postgres.c:906
#18 0x0000555afa9be8aa in pg_plan_queries (querytrees=0x555afc2e37c8,
    query_string=query_string@entry=0x555afc1754d0 "SELECT\n    DISTINCT \"gianluca_acc_test\".\"id\",\n    MAX(\"gianluca_camp_test\".\"start_timestamp\") OVER (PARTITION BY \"gianluca_camp_test\".\"id\")\nFROM\n    \"gianluca_acc_test\"\n    LEFT OUTER JOIN \"gianluca_c"..., cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
    at tcop/./build/../src/backend/tcop/postgres.c:998
#19 0x0000555afa9c2d65 in exec_simple_query (
    query_string=0x555afc1754d0 "SELECT\n    DISTINCT \"gianluca_acc_test\".\"id\",\n    MAX(\"gianluca_camp_test\".\"start_timestamp\") OVER (PARTITION BY \"gianluca_camp_test\".\"id\")\nFROM\n    \"gianluca_acc_test\"\n    LEFT OUTER JOIN \"gianluca_c"...) at tcop/./build/../src/backend/tcop/postgres.c:1195
#20 0x0000555afa9c5bc7 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at tcop/./build/../src/backend/tcop/postgres.c:4718
#21 0x0000555afa92e475 in BackendRun (port=0x555afc20fec0, port=0x555afc20fec0) at postmaster/./build/../src/backend/postmaster/postmaster.c:4464
#22 BackendStartup (port=0x555afc20fec0) at postmaster/./build/../src/backend/postmaster/postmaster.c:4192
#23 ServerLoop () at postmaster/./build/../src/backend/postmaster/postmaster.c:1782
#24 0x0000555afa924a20 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:1466
#25 0x0000555afa5e9d5d in main (argc=5, argv=0x555afc0e12d0) at main/./build/../src/backend/main/main.c:198

matikkane avatar Oct 15 '24 10:10 matikkane

FYI - tested with Citus 12.1 & PG 15, no segfault there.

matikkane avatar Oct 15 '24 11:10 matikkane

Also it works if the tables in question has only 1 shard

  • Create the distributed tables like below
DROP TABLE gianluca_acc_test;
DROP TABLE gianluca_camp_test;
CREATE TABLE gianluca_acc_test(
    id BYTEA PRIMARY KEY
);
CREATE TABLE gianluca_camp_test(
    id BYTEA,
    account_id BYTEA NOT NULL,
    start_timestamp INT,
    PRIMARY KEY(account_id, id)
);

SELECT create_distributed_table('gianluca_acc_test', 'id', shard_count => 1);
SELECT create_distributed_table('gianluca_camp_test', 'account_id', shard_count => 1);
  • insert the data and test
INSERT INTO gianluca_acc_test VALUES ('\x5b3f8be2f3fd8f25cf402b46'),('\x5aafffc481d8c13b05606f03');
INSERT INTO gianluca_camp_test VALUES
('\x3b3f8be2f3fd8f25cf402123','\x5aafffc481d8c13b05606f03',10),
('\x3b3f8be2f3fd8f25cf402456','\x5b3f8be2f3fd8f25cf402b46',20),
('\x3b3f8be2f3fd8f25cf402789','\x5aafffc481d8c13b05606f03',NULL);

SELECT
    DISTINCT "gianluca_acc_test"."id",
    MAX("gianluca_camp_test"."start_timestamp") OVER (PARTITION BY encode("gianluca_camp_test"."id", 'hex'))
FROM
    "gianluca_acc_test"
    LEFT OUTER JOIN "gianluca_camp_test" ON "gianluca_acc_test"."id" = "gianluca_camp_test"."account_id"
WHERE
    "gianluca_acc_test"."id" IN (
        '\x5b3f8be2f3fd8f25cf402b46',
        '\x5aafffc481d8c13b05606f03'
    );
    
              id             | max
----------------------------+-----
 \x5aafffc481d8c13b05606f03 |  10
 \x5aafffc481d8c13b05606f03 |
 \x5b3f8be2f3fd8f25cf402b46 |  20
(3 rows)

Here we face no segfault.

nilroy avatar Oct 15 '24 11:10 nilroy

Worked with @thanodnl and found that varnullingrels is causing the issue. Looking into bt, the last Citus function seems to be BuildSelectStatementViaStdPlanner, which is calling standard_planner(combineQuery, NULL, 0, NULL)

Running the same query for PG 16.1 and PG 14.5, combineQuery is same for both pg versions in the logs:

POSTGRES 16.1
SELECT DISTINCT id, max(max) OVER (PARTITION BY worker_column_3) AS max FROM pg_catalog.citus_extradata_container(10, NULL::cstring(0), NULL::cstring(0), NULL::cstring(0)) remote_scan(id bytea, max integer, worker_column_3 bytea)

POSTGRES 14.5
SELECT DISTINCT id, max(max) OVER (PARTITION BY worker_column_3) AS max FROM pg_catalog.citus_extradata_container(10, NULL::cstring(0), NULL::cstring(0), NULL::cstring(0)) remote_scan(id bytea, max integer, worker_column_3 bytea)

To debug further, put a breakpoint on combine_query_planner.c:304, this line: standardStmt = standard_planner(combineQuery, NULL, 0, NULL);

then we can see the entire object combineQuery using pprint. Type p pprint(combineQuery) on gdb and it will be printed in the coordinator logs, like:

   {QUERY
   :commandType 1
   :querySource 0
   :canSetTag true
   :utilityStmt <>
   **<shortening it for readability>**
   :isReturn false
   :cteList <>
   :rtable (
      {RANGETBLENTRY
      :alias <>
   **<shortening it for readability>**
         :funccoltypes (o 17 23 17)
         :funccoltypmods (i -1 -1 -1)
         :funccolcollations (o 0 0 0)
         :funcparams (b)
         }
      )
      :funcordinality false
      :lateral false
      :inh false
      :inFromCl true
      :securityQuals <>
      }
   )
   :rteperminfos <>
   :jointree
   **<shortening it for readability>**
      }
   :mergeActionList <>
   :mergeUseOuterJoin false
   :targetList (
      {TARGETENTRY
      :expr
         {VAR
         :varno 1
         :varattno 1
         :vartype 17
         :vartypmod -1
         :varcollid 0
         :varnullingrels (b)
         :varlevelsup 0
         :varnosyn 1
         :varattnosyn 1
         :location -1
         }
      :resno 1
      :resname id
      :ressortgroupref 1
      :resorigtbl 17262
      :resorigcol 1
      :resjunk false
      }
      {TARGETENTRY
      :expr
         {WINDOWFUNC
         :winfnoid 2116
         :wintype 23
         :wincollid 0
         :inputcollid 0
         :args (
            {VAR
            :varno 1
            :varattno 2
            :vartype 23
            :vartypmod -1
            :varcollid 0
            :varnullingrels (b 3)  // **this is the culprit**
            :varlevelsup 0
            :varnosyn 2
            :varattnosyn 3
            :location 54
            }
         )
         :aggfilter <>
         :winref 1
         :winstar false
         :winagg true
         :location 50
         }
      :resno 2
      :resname max
      :ressortgroupref 2
      :resorigtbl 0
      :resorigcol 0
      :resjunk false
      }
      {TARGETENTRY
      :expr
         {VAR
         :varno 1
         :varattno 3
         :vartype 17
         :vartypmod -1
         :varcollid 0
         :varnullingrels (b)
         :varlevelsup 0
         :varnosyn 1
         :varattnosyn 3
         :location -1
         }
      :resno 3
      :resname <>
      :ressortgroupref 3
      :resorigtbl 0
      :resorigcol 0
      :resjunk true
      }
   )
   :override 0
   :onConflict <>
   :returningList <>
   :groupClause <>
   **<shortening it for readability>**
   :setOperations <>
   :constraintDeps <>
   :withCheckOptions <>
   :stmt_location 0
   :stmt_len 0
   }

varnullingrels is introduced in PG 16 I believe. So when running with PG14, it doesn't show up there. On PG16, we somehow pass it to standard_planner incorrectly and the segfault happens. Run below command in gdb to set it to null

(gdb) set ((Var *)(((WindowFunc *)(((TargetEntry *)(combineQuery->targetList->elements[1]->ptr_value))->expr))->args->elements[0]))->varnullingrels=NULL

This is not a suggestion/workaround for the users facing the issue. It is for documenting what we have found, for further debugging and fixing.

Now, after continuing gdb, the query should be executed without segfault, giving the correct output. We should find out how varnullingrels gets added there, where it comes from, and fix it.

Assertion failure case, in case it helps somehow: If you try to reproduce it in dev environment, you will get an assertion failure, instead of a segfault, here: https://github.com/postgres/postgres/blob/master/src/backend/optimizer/path/equivclass.c#L738

in PG14 however, this line was like Assert(rel->reloptkind == RELOPT_BASEREL || rel->reloptkind == RELOPT_DEADREL); so it wouldn't fail. For some reason, RELOPT_DEADREL is removed before PG16 and now this assertion fails.

Disable assertions first to get the segfault.

agedemenli avatar Oct 18 '24 09:10 agedemenli

Note; the removed RELOPT_DEADREL is less likely to be of significance. When the assert hits we are reading. simple_rel_array indexed at 3 while it's length is 2, which is probably the cause of the segfault in production, except we hit it on an assert in dev.

The 3 comes from the varnullingrels that gets combined into relids with pull_varno(...) earlier in the function causing the segfault. Hence editing varnullingrels with gdb preventing the segfault/assert and running query as expected.

thanodnl avatar Oct 18 '24 12:10 thanodnl

Respectfully asking @thanodnl @agedemenli - any estimates for the timeline of producing a patch?

matikkane avatar Oct 24 '24 07:10 matikkane

@matikkane still looking into it, we have a handle on why the problem occurs and some further analysis follows, but it will be next week before we can give a reasonable estimate for a patch.

Building on the previous analysis, the problem occurs when building the distributed plan's target list; if a target can't be pushed down (it fails this check) the optimizer creates a target for execution on the coordinator, which makes a copy of the var. In the problem query case the window function cannot be pushed down and it uses a column from the inner side of the left outer join, so has a non-NULL varnullingrels (in Postgres 16+). The optimizer could (should?) use makeVar instead of copyObject when building column references for the non-pushed down targets. Additionally, we may need to look at other places Var nodes are copied and what (if anything) varnullingrels means in those places.

This is a variation of the problem query that results in a different error (in a dev environment), although its essentially the same problem:

SELECT "gianluca_acc_test"."id",
    MAX("gianluca_camp_test"."start_timestamp") OVER (PARTITION BY "gianluca_camp_test"."id")
FROM "gianluca_acc_test"
    LEFT OUTER JOIN "gianluca_camp_test" ON "gianluca_acc_test"."id" = "gianluca_camp_test"."account_id";
ERROR:  wrong varnullingrels (b 3) (expected (b)) for Var 1/2

Note that these queries:

SELECT "gianluca_acc_test"."id", max("gianluca_camp_test"."start_timestamp")
FROM "gianluca_acc_test"
    LEFT OUTER JOIN "gianluca_camp_test" ON "gianluca_acc_test"."id" = "gianluca_camp_test"."account_id" GROUP BY 1;
    
create or replace function foo(int) returns INT language sql stable as $$ select $1 + 42; $$ ;
SELECT "gianluca_acc_test"."id", foo("gianluca_camp_test"."start_timestamp")
FROM "gianluca_acc_test"
    LEFT OUTER JOIN "gianluca_camp_test" ON "gianluca_acc_test"."id" = "gianluca_camp_test"."account_id";

are ok because even though the second target of the queries refers to the inner side of the left outer join, they can both be pushed down so the target created is a var that has an empty varnullingrels.

colm-mchugh avatar Oct 24 '24 17:10 colm-mchugh

@colm-mchugh how far are we from the fix?

nilroy avatar Nov 13 '24 08:11 nilroy

@nilroy the fix has been merged to Citus main and backported to Citus 12.1.

colm-mchugh avatar Nov 13 '24 18:11 colm-mchugh

@colm-mchugh thank you very much

nilroy avatar Nov 14 '24 13:11 nilroy

@colm-mchugh doe the change has been pushed into citus official docker images?

nilroy avatar Nov 14 '24 13:11 nilroy