Segfault in Citus during AppendCopyRowData / intermediate results (memcpy invalid length in textsend)
Bug description
We observed a segmentation fault in Postgres+Citus when running a distributed query that uses intermediate results.
The crash happens inside AppendCopyRowData() in multi_copy.c, where appendBinaryStringInfo() is called with an invalid length (~160MB). This eventually leads to memcpy() segfaulting.
Versions
- PostgreSQL:
14.18 - Citus:
11.2.2 - Extensions also loaded: pg_stat_statements, pg_qualstats, pgaudit, pg_wait_sampling
- OS: Ubuntu "20.04.6 LTS (Focal Fossa)"
Steps to reproduce
- Execute a distributed query that produces intermediate results (involving FunctionScan and RemoteFileDestReceiver).
- The crash occurs when Citus tries to serialize rows for the distributed executor.
- Happens when a specific workload is triggered by user.
Note: Can provide SQL and schema along with logs privately if needed.
Backtrace Full
#1 0x00005636d881dd74 in memcpy (__len=168302937, __src=0x5637133b64ac, __dest=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:34
#2 appendBinaryStringInfo (str=0x7ffd025e1aa0, data=0x5637133b64ac "actual rows=1 loops=1)\n -> Function Scan on read_intermediate_result intermediate_result (actual rows=6 loops=1)\n", datalen=168302937)
at ./build/../src/common/stringinfo.c:235
#3 0x00005636d87af437 in textsend (fcinfo=<optimized out>) at ./build/../src/backend/utils/adt/varlena.c:605
#4 0x00005636d87dee41 in FunctionCall1Coll (flinfo=0x563712b51028, collation=<optimized out>, arg1=<optimized out>) at ./build/../src/backend/utils/fmgr/fmgr.c:1138
#5 0x00005636d87dfb72 in SendFunctionCall (flinfo=<optimized out>, val=<optimized out>) at ./build/../src/backend/utils/fmgr/fmgr.c:1636
#6 0x00007f19e2cc3b70 in AppendCopyRowData (valueArray=valueArray@entry=0x563712b50848, isNullArray=isNullArray@entry=0x563712b50860, rowDescriptor=rowDescriptor@entry=0x563712b50428,
rowOutputState=rowOutputState@entry=0x563712b50b48, columnOutputFunctions=columnOutputFunctions@entry=0x563712b50ff8, columnCoercionPaths=columnCoercionPaths@entry=0x0)
at ./src/backend/distributed/commands/multi_copy.c:1509
#7 0x00007f19e2cf971d in RemoteFileDestReceiverReceive (slot=<optimized out>, dest=0x5637132392f8) at ./src/backend/distributed/executor/intermediate_results.c:425
#8 0x00005636d85411ca in ExecutePlan (dest=0x5637132392f8, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, queryDesc=0x56371305d648)
at ./build/../src/backend/executor/execMain.c:1586
#9 standard_ExecutorRun (queryDesc=queryDesc@entry=0x56371305d648, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=execute_once@entry=false)
at ./build/../src/backend/executor/execMain.c:360
#10 0x00007f19e2cfb1b1 in CitusExecutorRun (queryDesc=0x56371305d648, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./src/backend/distributed/executor/multi_executor.c:238
#11 0x00007f19e29779a2 in pgss_ExecutorRun (queryDesc=0x56371305d648, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1001
#12 0x00007f19e2918505 in ?? () from /usr/lib/postgresql/14/lib/pg_qualstats.so
#13 0x00007f19e290ebc1 in pgaudit_ExecutorRun_hook (queryDesc=0x56371305d648, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>) at ./pgaudit.c:1437
#14 0x00007f19e290323c in ?? () from /usr/lib/postgresql/14/lib/pg_wait_sampling.so
#15 0x00005636d86b9299 in PortalRunSelect (portal=0x563712a8a1d8, forward=<optimized out>, count=0, dest=<optimized out>) at ./build/../src/backend/tcop/pquery.c:919
#16 0x00005636d86ba5b8 in PortalRun (portal=portal@entry=0x563712a8a1d8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=false, run_once=run_once@entry=true, dest=dest@entry=0x5637132392f8,
altdest=altdest@entry=0x5637132392f8, qc=0x0) at ./build/../src/backend/tcop/pquery.c:763
#17 0x00007f19e2cfbaea in ExecutePlanIntoDestReceiver (queryPlan=0x563713349598, params=0x0, dest=0x5637132392f8) at ./src/backend/distributed/executor/multi_executor.c:699
#18 0x00007f19e2cfdfd3 in ExecuteSubPlans (distributedPlan=distributedPlan@entry=0x5637132267d8) at ./src/backend/distributed/executor/subplan_execution.c:81
#19 0x00007f19e2cf29b4 in AdaptiveExecutorPreExecutorRun (scanState=0x563712b3f4b8) at ./src/backend/distributed/executor/adaptive_executor.c:778
#20 0x00007f19e2cfb170 in CitusExecutorRun (queryDesc=0x56371327e4e8, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./src/backend/distributed/executor/multi_executor.c:231
#21 0x00007f19e29779a2 in pgss_ExecutorRun (queryDesc=0x56371327e4e8, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1001
#22 0x00007f19e2918505 in ?? () from /usr/lib/postgresql/14/lib/pg_qualstats.so
#23 0x00007f19e290ebc1 in pgaudit_ExecutorRun_hook (queryDesc=0x56371327e4e8, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>) at ./pgaudit.c:1437
#24 0x00007f19e290323c in ?? () from /usr/lib/postgresql/14/lib/pg_wait_sampling.so
#25 0x00005636d86b9299 in PortalRunSelect (portal=0x563712a8a0c8, forward=<optimized out>, count=0, dest=<optimized out>) at ./build/../src/backend/tcop/pquery.c:919
#26 0x00005636d86ba5b8 in PortalRun (portal=0x563712a8a0c8, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x563712967278, altdest=0x563712967278, qc=0x7ffd025e2c20)
at ./build/../src/backend/tcop/pquery.c:763
#27 0x00005636d86b7a21 in PostgresMain () at ./build/../src/backend/tcop/postgres.c:2180
#28 0x00005636d8635ceb in BackendRun (port=<optimized out>, port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4541
#29 BackendStartup (port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4263
#30 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1750
#31 0x00005636d8636b5a in PostmasterMain (argc=<optimized out>, argv=0x56371295f960) at ./build/../src/backend/postmaster/postmaster.c:1422
#32 0x00005636d83ae148 in main (argc=17, argv=0x56371295f960) at ./build/../src/backend/main/main.c:211
(gdb)
Notes The immediate crash is inside appendBinaryStringInfo() called from AppendCopyRowData() (multi_copy.c:1509). The problematic length passed is ~168MB, likely due to corrupted or miscalculated text datum length. Possibly related to COPY serialisation of function scan results in distributed queries.
Thanks for reporting this @upmanish . Have you encountered the issue with other Citus versions?
Note: Can provide SQL and schema along with logs privately if needed.
Yes, knowing the SQL and the schema would be helpful in reproducing this. Also the table size(s) and Citus cluster size. Thanks!
Hey @colm-mchugh Appreciate your response.
Here are the additional details:
- Citus Cluster:
2.1 TB - Sharded table size:
85 GB - Number of shards:
24
Offending SQL:
WITH all_records AS ( SELECT t.metric_value AS data0, CASE WHEN t.metric_value > $5 THEN $6 ELSE $7 END AS data1, t.category_id AS col0 FROM main_table t WHERE t.org_id = $1 AND (t.event_timestamp BETWEEN $2 AND $3) ), category_aggregation AS ( SELECT SUM(data0) AS agg0, ROUND(AVG(data1)) AS agg1, col0 FROM all_records GROUP BY col0 ORDER BY agg0 DESC LIMIT $4 ), overall_summary AS ( SELECT COALESCE(SUM(data0), $8), COALESCE(ROUND(AVG(data1)), $9), $10::uuid FROM all_records ) SELECT * FROM category_aggregation UNION ALL SELECT * FROM overall_summary;
Table definition
CREATE TABLE public.main_table ( org_id uuid NOT NULL, category_id uuid NOT NULL, status integer, created_at timestamp without time zone, modified_at timestamp without time zone, event_timestamp timestamp without time zone, score_percentage real, metric_value integer, completed_at timestamp without time zone );
Thanks @upmanish this will really help in reproducing. One other detail - how many nodes in your Citus cluster ?
Could you try removing the extensions from shared_preload_library and reproduce the crash? This would help us narrow down the investigation.