gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

Error: "attempted to update invisible tuple" in truncate command

Open interma opened this issue 10 months ago • 12 comments

Bug Report

In the process of https://github.com/greenplum-db/gpdb/pull/17200, we found the test case reported:

ERROR:  spiexceptions.ObjectNotInPrerequisiteState: attempted to update invisible tuple

We reproduced it with adding a similar test in the latest 7x code (only add test case without any change of GP code): https://github.com/greenplum-db/gpdb/compare/main...yaowangm:gpdb:v7_trunc_error

Same error in the dev pipeline: https://dev.ci.gpdb.pivotal.io/teams/main/pipelines/wayao_trunc_error/jobs/icw_gporca_resgroup_rocky8/builds/1

This error is very similar to https://github.com/greenplum-db/gpdb/commit/3fbe189e45166765ac0e0c695e7eecf7aa49d9a6, but it's on heap table.

Greenplum version or build

lastest 7X, top commit: https://github.com/greenplum-db/gpdb/commit/e747b2e1c2a67fd84a8213ccd5c261869c77d772 2024-04-01

OS version and uname -a

autoconf options used ( config.status --config )

Installation information ( pg_config )

Expected behavior

no errors

Actual behavior

Error: "attempted to update invisible tuple"

Step to reproduce the behavior

please see the code diff

Simple reproduce using isolation2

create table t1(a int);
create table t2(a int);

1: begin;
1: select txid_current();
1: end;

2: begin;
2: select txid_current();

3: begin;
3: savepoint s1;
3: truncate t2;

4: begin;
4: select txid_current();
4: end;

5: begin;
5: select txid_current();

6: select * from t1 join (select oid from pg_class) x(a) on x.a = t1.a;

3: savepoint s2;
3: truncate t2;

1q:
2q:
3q:
4q:
5q:
6q:

Save the above SQL in a file a.sql, and then run:

python src/test/isolation2/sql_isolation_testcase.py < ~/a.sql

...
...


3: savepoint s2;
SAVEPOINT
3: truncate t2;
ERROR:  attempted to update invisible tuple

1q: ... <quitting>
2q: ... <quitting>
3q: ... <quitting>
4q: ... <quitting>
5q: ... <quitting>
6q: ... <quitting>

interma avatar Apr 01 '24 06:04 interma

Besides the error above, there is another error on https://dev.ci.gpdb.pivotal.io/teams/main/pipelines/wayao_trunc_error/jobs/icw_gporca_jit_rocky8/builds/1:

diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /tmp/build/e18b2f02/gpdb_src/src/test/regress/expected/gp_sort.out /tmp/build/e18b2f02/gpdb_src/src/test/regress/results/gp_sort.out
--- /tmp/build/e18b2f02/gpdb_src/src/test/regress/expected/gp_sort.out	2024-04-01 03:21:45.118432879 +0000
+++ /tmp/build/e18b2f02/gpdb_src/src/test/regress/results/gp_sort.out	2024-04-01 03:21:45.122433151 +0000

...
+DETAIL:  Key (oid)=(43510) already exists.
+ERROR:  spiexceptions.UniqueViolation: duplicate key value violates unique constraint "pg_class_oid_index"
+CONTEXT:  Traceback (most recent call last):
+  PL/Python function "gp_mksort_test", line 22, in <module>
+    plpy.execute("truncate gp_mksort_test_table;")
+PL/Python function "gp_mksort_test"

I suppose they are due to the same root cause.

yaowangm avatar Apr 01 '24 07:04 yaowangm

The issue is easy to be repro via pipeline (not 100%), but difficult on my local env.

Repro:

Under src/test/regress, create a new file my_schedule:

test: test_setup
test: gp_aggregates gp_aggregates_costs gp_metadata variadic_parameters default_parameters function_extensions spi gp_xml shared_scan update_gp triggers_gp returning_gp gp_types combocid_gp gp_sort gp_prepared_xacts gp_backend_info gp_locale foreign_key_gp

Then keep running the script:

./pg_regress --init-file=init_file --load-extension=gp_inject_fault --schedule=my_schedule

Repeat for a couple of minutes, we may see gp_sort failed.

I got the coredump with adding an extra assertion to the error code path:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f4c60d17859 in __GI_abort () at abort.c:79
#2  0x00005572b1c2cef0 in ExceptionalCondition (conditionName=conditionName@entry=0x5572b21d1c89 "!(0)",
    errorType=errorType@entry=0x5572b21cc00b "FailedAssertion", fileName=fileName@entry=0x5572b21eda92 "heapam.c",
    lineNumber=lineNumber@entry=3226) at assert.c:44
#3  0x00005572b1699c05 in heap_update_internal (relation=relation@entry=0x7f4c5307bdf0,
    otid=otid@entry=0x5572bb2669ac, newtup=newtup@entry=0x5572bb2669a8, cid=<optimized out>,
    crosscheck=crosscheck@entry=0x0, wait=wait@entry=true, tmfd=0x7ffe64a9df30, lockmode=0x7ffe64a9df2c, simple=true)
    at heapam.c:3226
#4  0x00005572b169b2c2 in simple_heap_update (relation=relation@entry=0x7f4c5307bdf0, otid=otid@entry=0x5572bb2669ac,
    tup=tup@entry=0x5572bb2669a8) at heapam.c:4108
#5  0x00005572b17759ba in CatalogTupleUpdate (heapRel=heapRel@entry=0x7f4c5307bdf0, otid=otid@entry=0x5572bb2669ac,
    tup=tup@entry=0x5572bb2669a8) at indexing.c:286
#6  0x00005572b1c15840 in RelationSetNewRelfilenode (relation=relation@entry=0x7f4c53030b18, persistence=112 'p')
    at relcache.c:3856
#7  0x00005572b187cbdf in ExecuteTruncateGuts (explicit_rels=explicit_rels@entry=0x5572bb2666a0,
    relids=<optimized out>, relids@entry=0x5572bb266700, relids_logged=relids_logged@entry=0x0,
    behavior=DROP_RESTRICT, restart_seqs=<optimized out>, stmt=stmt@entry=0x5572bb265fd0) at tablecmds.c:2333
#8  0x00005572b187d11d in ExecuteTruncate (stmt=stmt@entry=0x5572bb265fd0) at tablecmds.c:2116
#9  0x00005572b1ae5c35 in standard_ProcessUtility (pstmt=pstmt@entry=0x5572bb266218,
    queryString=queryString@entry=0x7f4c528aa560 "truncate gp_mksort_test_table;",
    context=context@entry=PROCESS_UTILITY_QUERY, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0,
    dest=dest@entry=0x5572b27871e0 <spi_printtupDR>, completionTag=0x7ffe64a9e340 "") at utility.c:637
#10 0x00005572b1ae6793 in ProcessUtility (pstmt=pstmt@entry=0x5572bb266218,
    queryString=0x7f4c528aa560 "truncate gp_mksort_test_table;", context=PROCESS_UTILITY_QUERY,
    params=params@entry=0x0, queryEnv=0x0, dest=dest@entry=0x5572b27871e0 <spi_printtupDR>,
    completionTag=0x7ffe64a9e340 "") at utility.c:403
#11 0x00005572b1916707 in _SPI_execute_plan (plan=plan@entry=0x7ffe64a9e3e0, paramLI=paramLI@entry=0x0,
    snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=false,
    fire_triggers=fire_triggers@entry=true, tcount=0) at spi.c:2554
#12 0x00005572b1916b3e in SPI_execute (src=src@entry=0x7f4c528aa560 "truncate gp_mksort_test_table;",
    read_only=<optimized out>, tcount=tcount@entry=0) at spi.c:637
#13 0x00007f4c52ffc05b in PLy_spi_execute_query (query=<optimized out>, limit=0) at plpy_spi.c:353
--Type <RET> for more, q to quit, c to continue without paging--
#14 0x00007f4c52ffc7f8 in PLy_spi_execute (self=<optimized out>, args=0x7f4c52895f10) at plpy_spi.c:180
#15 0x00007f4c52d23687 in ?? () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#16 0x00007f4c52d23b1b in _PyObject_MakeTpCall () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#17 0x00007f4c52aefdf3 in ?? () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#18 0x00007f4c52af7ef6 in _PyEval_EvalFrameDefault () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#19 0x00007f4c52afb06b in ?? () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#20 0x00007f4c52aefd6d in ?? () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#21 0x00007f4c52af146d in _PyEval_EvalFrameDefault () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#22 0x00007f4c52c45e3b in _PyEval_EvalCodeWithName () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#23 0x00007f4c52c461c2 in PyEval_EvalCodeEx () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#24 0x00007f4c52c465af in PyEval_EvalCode () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
#25 0x00007f4c52ff5be4 in PLy_procedure_call (proc=proc@entry=0x5572bb267920,
    kargs=kargs@entry=0x7f4c530007ad "args", vargs=<optimized out>) at plpy_exec.c:1060
#26 0x00007f4c52ff7591 in PLy_exec_function (fcinfo=fcinfo@entry=0x5572b9d3ca58, proc=0x5572bb267920)
    at plpy_exec.c:108
#27 0x00007f4c52ff85ee in plpython3_call_handler (fcinfo=0x5572b9d3ca58) at plpy_main.c:364
#28 0x00005572b18c5603 in ExecInterpExpr (state=0x5572b9d3c970, econtext=0x5572b9d3c660, isnull=0x7ffe64a9ef27)
    at execExprInterp.c:639
#29 0x00005572b18c219b in ExecInterpExprStillValid (state=0x5572b9d3c970, econtext=0x5572b9d3c660,
    isNull=0x7ffe64a9ef27) at execExprInterp.c:1864
#30 0x00005572b1904dce in ExecEvalExprSwitchContext (isNull=0x7ffe64a9ef27, econtext=0x5572b9d3c660,
    state=0x5572b9d3c970) at ../../../src/include/executor/executor.h:362
#31 ExecProject (projInfo=0x5572b9d3c968) at ../../../src/include/executor/executor.h:396
#32 ExecResult (pstate=0x5572b9d3c548) at nodeResult.c:157
#33 0x00005572b18d4023 in ExecProcNodeGPDB (node=node@entry=0x5572b9d3c548) at execProcnode.c:645
#34 0x00005572b18d40d8 in ExecProcNodeFirst (node=0x5572b9d3c548) at execProcnode.c:600
#35 0x00005572b18ca0df in ExecProcNode (node=0x5572b9d3c548) at ../../../src/include/executor/executor.h:271
#36 ExecutePlan (estate=estate@entry=0x5572b9d3c240, planstate=0x5572b9d3c548, use_parallel_mode=<optimized out>,
    operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=true, numberTuples=numberTuples@entry=0,
    direction=ForwardScanDirection, dest=0x5572b9a72908, execute_once=true) at execMain.c:2672
#37 0x00005572b18cafbf in standard_ExecutorRun (queryDesc=queryDesc@entry=0x5572b9bc3080,
    direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=execute_once@entry=true)
    at execMain.c:978
--Type <RET> for more, q to quit, c to continue without paging--
#38 0x00005572b18cb22e in ExecutorRun (queryDesc=queryDesc@entry=0x5572b9bc3080,
    direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=<optimized out>)
    at execMain.c:793
#39 0x00005572b1ae1ce1 in PortalRunSelect (portal=portal@entry=0x5572b9b6e900, forward=forward@entry=true, count=0,
    count@entry=9223372036854775807, dest=dest@entry=0x5572b9a72908) at pquery.c:1147
#40 0x00005572b1ae3db9 in PortalRun (portal=portal@entry=0x5572b9b6e900, count=count@entry=9223372036854775807,
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x5572b9a72908,
    altdest=altdest@entry=0x5572b9a72908, completionTag=0x7ffe64a9f450 "") at pquery.c:985
#41 0x00005572b1addc8f in exec_simple_query (
    query_string=query_string@entry=0x5572b9a71550 "select gp_mksort_test();") at postgres.c:1922
#42 0x00005572b1ae0e7a in PostgresMain (argc=<optimized out>, argv=argv@entry=0x5572b9a9e980, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:5393
#43 0x00005572b1a39ac7 in BackendRun (port=port@entry=0x5572b9a96680) at postmaster.c:4985
#44 0x00005572b1a3c068 in BackendStartup (port=port@entry=0x5572b9a96680) at postmaster.c:4662
#45 0x00005572b1a3dcf8 in ServerLoop () at postmaster.c:2016
#46 0x00005572b1a3f4c0 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster.c:1631
#47 0x00005572b19508da in main (argc=5, argv=0x5572b9a6b1f0) at main.c:240

It seems that catalog table was damaged under particular concurrency scenario. More investigation is needed.

yaowangm avatar Apr 01 '24 12:04 yaowangm

6X repro:

I tried to repro the issue on 6X on both local env and pipeline, but failed by now.

I merged the test case gp_sort to 6X, and kept running the script for a couple of hours:

num=1
while [ $num -le 1000000 ]
do
    echo $num
    num=$(( $num + 1 ))
    ./pg_regress --init-file=init_file --load-extension=gp_inject_fault --schedule=my_schedule |tee ./reg_result.txt;
	res=`grep "gp_sort" ./reg_result.txt | grep "FAILED";`;
	if [ -n "$res" ]
	then
		echo "got it!!!";
		break;
	fi
done

But didn't repro.

I also ran the pipeline with the test case: https://dev.ci.gpdb.pivotal.io/teams/main/pipelines/wayao_v6_trunc_error and all is ok.

We cannot conclude now because I removed some test cases of the script because they don't exist on 6X, and the concurrency scenario is subtle. The fact may be that error doesn't exist on 6X, or specific concurrency condition is changed.

I'd suggest to check the issue on 6X after we got the RCA, at the time we should know more about how to repro.

yaowangm avatar Apr 03 '24 13:04 yaowangm

After many hours, now I am able to reproduce this bug in my local env with the help of GDB.

I use rr to record the bug and then find RCA.


RCA

Session 1: a query that has sub transactions

  • 1st subtransaction truncate the table, finish, and commit the subtransaction, introduce a new tuple in pg_class with (xmin=subtrans_xid, xmax=0)
  • 2nd subtransaction yet to truncate the table

Session 2: a query has entryDB which will seqscan pg_class

EntryDB QE is reader gang:

	/* reader gang copy local snapshot from writer gang */
	if (SharedLocalSnapshotSlot != NULL &&
		(distributedTransactionContext == DTX_CONTEXT_QE_READER ||
		 distributedTransactionContext == DTX_CONTEXT_QE_ENTRY_DB_SINGLETON))
	{
		readerFillLocalSnapshot(snapshot, distributedTransactionContext);
		return snapshot;
	}

readerFillLocalSnapshot --> copyLocalSnapshot(snapshot); will set

	snapshot->curcid = SharedLocalSnapshotSlot->snapshot.curcid;
	snapshot->subxcnt = -1;

which means the snapshot of this session don't consider subtransaction. Blame the code, seems very very very old code.

Now session 2 scan the tuple of pg_class created by session 1 with subtrans_xid in its xmin, testing MVCC, session 2 will find subtrans_xid not in MVCC, and try to set a hint bit to that tuple with xmin invalid:

			snapshotCheckResult =
				XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot,
								  ((tuple->t_infomask2 & HEAP_XMIN_DISTRIBUTED_SNAPSHOT_IGNORE) != 0),
								  &setDistributedSnapshotIgnore);
			if (setDistributedSnapshotIgnore)
			{
				tuple->t_infomask2 |= HEAP_XMIN_DISTRIBUTED_SNAPSHOT_IGNORE;
				markDirty(buffer, relation, tuple, /* isXmin */ true);
			}
			if (snapshotCheckResult == XID_IN_SNAPSHOT)
				return false;
			else if (snapshotCheckResult == XID_SURELY_COMMITTED || TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
				SetHintBits(tuple, buffer, relation, HEAP_XMIN_COMMITTED,
							HeapTupleHeaderGetRawXmin(tuple));
			else
			{
				/* it must have aborted or crashed */
				SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
							InvalidTransactionId);
				return false;
			}

Now things come back to session 1, in the 2nd sub transaction, it going to do a catalog heap update, but the tuple is wrongly set hintbit, lead to the error.

kainwen avatar Apr 04 '24 15:04 kainwen

I think this bug is introduced during merging PG9.6 of upstream code https://github.com/greenplum-db/gpdb-postgres-merge/commit/8a7d0701814a4e293efad22091d6f6fb441bbe1c

In GP6 (pg94), it will always scan the shared memory procarray so that subtransaction is considered.

kainwen avatar Apr 04 '24 16:04 kainwen

Just for reference and keeping context, before commit https://github.com/greenplum-db/gpdb/commit/b52ca70fdfde78d4de0cd31b3a9cf7855c7a7239 SharedLocalSnapshot used to carry subtransaction information as well. Though it used to go stale in some scenarios as mentioned in commit and hence was removed and instead pointer to Writer's PGPROC was maintained to perform the lookup. Though given now the code is replaced to use snapshot and not procarray the logic is broken.

ashwinstar avatar Apr 05 '24 01:04 ashwinstar

For more info, after offline discussed with @kainwen (thank your for RCA for this hard issue), it can be sure that I hit the same case on xmax. Also pasted the related code in HeapTupleSatisfiesMVCC:

1236     if (!(tuple->t_infomask & HEAP_XMAX_COMMITTED))
1237     {           
1238         if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
1239         {                       
1240             if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
1241                 return true;    /* deleted after scan started */
1242             else
1243                 return false;   /* deleted before scan started */
1244         }
1245 
1246         snapshotCheckResult = XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot,
1247                                        ((tuple->t_infomask2 & HEAP_XMAX_DISTRIBUTED_SNAPSHOT_IGNORE) != 0),
1248                                        &setDistributedSnapshotIgnore);
1249         if (setDistributedSnapshotIgnore)
1250         {
1251             tuple->t_infomask2 |= HEAP_XMAX_DISTRIBUTED_SNAPSHOT_IGNORE;
1252             markDirty(buffer, relation, tuple, /* isXmin */ false);
1253         }
1254         if (snapshotCheckResult == XID_IN_SNAPSHOT)
1255             return true;
1256 
1257         if (!(snapshotCheckResult == XID_SURELY_COMMITTED || TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple))))
1258         {              
1259             /* it must have aborted or crashed */
1260             SetHintBits(tuple, buffer, relation, HEAP_XMAX_INVALID,
1261                         InvalidTransactionId);
1262             return true;
1263         }       
1264     
1265         /* xmax transaction committed */
1266         SetHintBits(tuple, buffer, relation, HEAP_XMAX_COMMITTED,
1267                     HeapTupleHeaderGetRawXmax(tuple));
1268     }           

The behavior is that the non-committed xmax (current subxact xid) is not in snapshot and TransactionIdDidCommit returned false but the actual subxact status is in-progress (line# 1257), in this situation, it was treated as abort or crash for this xmax so set hint bit as HEAP_XMAX_INVALID, as a result, leaving a corrupt tuple with a visible "invalid" xmax in pg_class but actually the xmax was finally committed. This corrupt record could even get worse after a CHECKPOINT, later query on the table pointed by this record will be failed with ERROR: could not open file "base/21313/17523": No such file or directory.

I tried out a straightforward workaround: just wait on in-progress status till the xmax reaching to the final status: commit or abort. Refer to the patch here for detail and seeing fixable for this issue, but seemingly that breaks the following Upstream rule:

There should never be a case where XidInMVCCSnapshot says a transaction is
done while TransactionIdIsInProgress says it's still running.  The other
way around is quite possible though.

A question around in my mind is: the code logically allowed in-progress status to be returned but force it to be treated as abort (guard this just by a comment), is that still true for GP and how to verify it ?

haolinw avatar Apr 08 '24 08:04 haolinw

A very simple local reproduce is using isolation 2:

create table t1(a int);
create table t2(a int);

1: begin;
1: select txid_current();
1: end;

2: begin;
2: select txid_current();

3: begin;
3: savepoint s1;
3: truncate t2;

4: begin;
4: select txid_current();
4: end;

5: begin;
5: select txid_current();

6: select * from t1 join (select oid from pg_class) x(a) on x.a = t1.a;

3: savepoint s2;
3: truncate t2;

1q:
2q:
3q:
4q:
5q:
6q:

kainwen avatar Apr 09 '24 04:04 kainwen

A question around in my mind is: the code logically allowed in-progress status to be returned but force it to be treated as abort (guard this just by a comment), is that still true for GP and how to verify it ?

Dig more and have an answer of this question:

as the comments of TransactionIdIsInProgress indication, lookup clog is the slowest way to know whether the transaction status is in progress or not, so there were other faster ways to check status from PGXACT or PGPROC structure, this is the fact that the old code did consider in-progress status. Later, another faster way which consulting XidInMVCCSnapshot was introduced in commit: https://github.com/greenplum-db/gpdb-postgres-merge/commit/8a7d0701814a4e293efad22091d6f6fb441bbe1c, replaced consulting TransactionIdIsInProgress.

That makes sense for better performance consideration so the rule needs to be followed in GP as well. Besides considering subxact status in snapshot check, one minor additional work may be to ensure consulting snapshot covering all cases which TransactionIdIsInProgress covered, need to introduce some assertion code (i.e. via slow path TransactionLogFetch ?) to capture out-law scenarios.

haolinw avatar Apr 09 '24 05:04 haolinw

one minor additional work may be to ensure consulting snapshot covering all cases which TransactionIdIsInProgress covered, need to introduce some assertion code (i.e. via slow path TransactionLogFetch ?) to capture out-law scenarios.

I like the idea of having cross checking via slowpath in assert enabled builds, though just need to makes sure the code introduced to do the same doesn't cause too much pain during later upstream merges. Plus it would be in very hot area of code needs extreme caution to introduce everything only for assert enabled builds only.

ashwinstar avatar Apr 10 '24 19:04 ashwinstar

Fixed by https://github.com/greenplum-db/gpdb/pull/17327. I also add the case here into that PR.

kainwen avatar Apr 20 '24 11:04 kainwen

Reopen this issue since the first fix was reverted, we will have a new solution.

adam8157 avatar Apr 26 '24 02:04 adam8157