gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

Concurrent updates cause incorrect query results

Open yanboer opened this issue 1 year ago • 9 comments
trafficstars

Bug Report

Greenplum version or build

postgres=# select version();
                                                                                                                version

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------
 PostgreSQL 12.12 (Greenplum Database 7.0.0 build commit:0a7a3566873325aca1789ae6f818c80f17a9402d) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit compiled on Oct 19 2023 16:40:25 Bhuvne
sh C.
(1 row)

OS version and uname -a

[postgres@mdw ~]$ uname -a
Linux mdw 5.4.258-1.el7.elrepo.x86_64 #1 SMP Tue Oct 10 22:42:49 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

autoconf options used ( config.status --config )

Installation information ( pg_config )

[postgres@mdw ~]$ pg_config
BINDIR = /usr/local/greenplum-db/bin
DOCDIR = /usr/local/greenplum-db/share/doc/postgresql
HTMLDIR = /usr/local/greenplum-db/share/doc/postgresql
INCLUDEDIR = /usr/local/greenplum-db/include
PKGINCLUDEDIR = /usr/local/greenplum-db/include/postgresql
INCLUDEDIR-SERVER = /usr/local/greenplum-db/include/postgresql/server
LIBDIR = /usr/local/greenplum-db/lib
PKGLIBDIR = /usr/local/greenplum-db/lib/postgresql
LOCALEDIR = /usr/local/greenplum-db/share/locale
MANDIR = /usr/local/greenplum-db/share/man
SHAREDIR = /usr/local/greenplum-db/share/postgresql
SYSCONFDIR = /usr/local/greenplum-db/etc/postgresql
PGXS = /usr/local/greenplum-db/lib/postgresql/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/usr/local/gpdb' '--disable-orca' '--disable-gpcloud' '--disable-gpfdist' '--with-perl' '--with-python' '--with-libxml' 'CFLAGS=-g -O3 -march=native' 'CXXFLAGS=-g -O3 -march=native'
CC = gcc -std=gnu99
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-but-set-variable -g -O3 -march=native  -Werror=uninitialized -Werror=implicit-function-declaration
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/usr/local/gpdb/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lpthread -lbz2 -lxml2 -lrt -lz -lreadline -lrt -lcrypt -ldl -lm  -lcurl   -lzstd
VERSION = PostgreSQL 12.12

Expected behavior

select obtains the tuple information after update

Actual behavior

select obtains tuple information before update

Step to reproduce the behavior

table like this:

create table test(id int not null primary key, name text);

insert into test select 1, '';

function like this:

CREATE OR REPLACE FUNCTION test_transaction(a INTEGER, b text)
RETURNS VOID
AS $$
declare temp_befor_name text;
declare temp_after_name text;
declare now_time timestamp;
BEGIN
  select name into temp_befor_name from test where id = a;

  update test set name = b where id = a;
  select now() into now_time;
  
  -- perform pg_sleep(0.01);

  select name into temp_after_name from test where id = a;
  
  
  raise notice 'time: %, % update from % to %', now_time, a, temp_befor_name, b;
  if temp_after_name != b then
    raise warning 'time: %, % name befor is: %, after is: %, expect is: %', now_time, a, temp_befor_name, temp_after_name, b;
  end if;

END;
$$ LANGUAGE plpgsql;

test_transaction.sql

select test_transaction(1, md5(random()::text));

concurrent execute pgbench(Use two terminals to execute at the same time)

pgbench -U postgres postgres -n -T 30 -f test_transaction.sql

When I execute the test_transaction function concurrently, the result of the second select is incorrect. The log is as follows:

2024-02-29 05:05:37.101984 UTC,"postgres","postgres",p19839,th-830736320,"10.10.1.57","54026",2024-02-29 05:05:17 UTC,0,con21437,cmd2628,seg-1,,dx74479809,,sx1,"WARNING","01000","time: 2024-02-29 05:05:37.078065, 1 name befor is: 1edae5f23bcc99ac0dde0e03e962ffbe, after is: 1edae5f23bcc99ac0dde0e03e962ffbe, expect is: 1c07a9d8d3484d0d5ff3d0d7d514f0fb",,,,,"PL/pgSQL function test_transaction(integer,text) line 18 at RAISE","select test_transaction(1, md5(random()::text));",0,,"pl_exec.c",3857,
2024-02-29 05:05:38.183380 UTC,"postgres","postgres",p19838,th-830736320,"10.10.1.57","54024",2024-02-29 05:05:17 UTC,0,con21436,cmd2778,seg-1,,dx74479858,,sx1,"WARNING","01000","time: 2024-02-29 05:05:38.159081, 1 name befor is: 521fc690564cd4272876739422d4a894, after is: 521fc690564cd4272876739422d4a894, expect is: bceaea07dd4ca7a78c894167eebb0273",,,,,"PL/pgSQL function test_transaction(integer,text) line 18 at RAISE","select test_transaction(1, md5(random()::text));",0,,"pl_exec.c",3857,
2024-02-29 05:05:42.266133 UTC,"postgres","postgres",p19839,th-830736320,"10.10.1.57","54026",2024-02-29 05:05:17 UTC,0,con21437,cmd3318,seg-1,,dx74480039,,sx1,"WARNING","01000","time: 2024-02-29 05:05:42.242528, 1 name befor is: 9096cb64212f0f4bcb8e3fee5dd1dd5d, after is: 9096cb64212f0f4bcb8e3fee5dd1dd5d, expect is: e29fba909d5d174222d51f821e3de822",,,,,"PL/pgSQL function test_transaction(integer,text) line 18 at RAISE","select test_transaction(1, md5(random()::text));",0,,"pl_exec.c",3857,
2024-02-29 05:05:45.660711 UTC,"postgres","postgres",p19839,th-830736320,"10.10.1.57","54026",2024-02-29 05:05:17 UTC,0,con21437,cmd3774,seg-1,,dx74480191,,sx1,"WARNING","01000","time: 2024-02-29 05:05:45.636774, 1 name befor is: cc433f55b2df1a5743547dde60b53c32, after is: cc433f55b2df1a5743547dde60b53c32, expect is: 6e8858215ee32a47a419ff7d429e484d",,,,,"PL/pgSQL function test_transaction(integer,text) line 18 at RAISE","select test_transaction(1, md5(random()::text));",0,,"pl_exec.c",3857,
2024-02-29 05:05:46.321639 UTC,"postgres","postgres",p19839,th-830736320,"10.10.1.57","54026",2024-02-29 05:05:17 UTC,0,con21437,cmd3864,seg-1,,dx74480221,,sx1,"WARNING","01000","time: 2024-02-29 05:05:46.297218, 1 name befor is: 549cfe3cb5ab0bcade2696a9aa3e8964, after is: 549cfe3cb5ab0bcade2696a9aa3e8964, expect is: 369bb1f0c41f0f384dc9d25e402f141e",,,,,"PL/pgSQL function test_transaction(integer,text) line 18 at RAISE","select test_transaction(1, md5(random()::text));",0,,"pl_exec.c",3857,
2024-02-29 05:06:01.919896 UTC,"postgres","postgres",p19912,th-830736320,"10.10.1.57","54180",2024-02-29 05:06:01 UTC,0,con21458,cmd60,seg-1,,dx74480289,,sx1,"WARNING","01000","time: 2024-02-29 05:06:01.89425, 1 name befor is: 4ebc2315c608231241400dd9d8306669, after is: 4ebc2315c608231241400dd9d8306669, expect is: 638d481ee70a67f18291cb0e653f150a",,,,,"PL/pgSQL function test_transaction(integer,text) line 18 at RAISE","select test_transaction(1, md5(random()::text));",0,,"pl_exec.c",3857,
2024-02-29 05:06:02.885246 UTC,"postgres","postgres",p19912,th-830736320,"10.10.1.57","54180",2024-02-29 05:06:01 UTC,0,con21458,cmd186,seg-1,,dx74480331,,sx1,"WARNING","01000","time: 2024-02-29 05:06:02.859857, 1 name befor is: aac839b4839784d4f96d9e8b4f21137a, after is: aac839b4839784d4f96d9e8b4f21137a, expect is: 64d6c4a9c38f0707744dc46dc51e65f1",,,,,"PL/pgSQL function test_transaction(integer,text) line 18 at RAISE","select test_transaction(1, md5(random()::text));",0,,"pl_exec.c",3857,

It seems that select cannot see the results of the current transaction update?

environment:

postgres=# select version();
                                                                                                                version

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------
 PostgreSQL 12.12 (Greenplum Database 7.0.0 build commit:0a7a3566873325aca1789ae6f818c80f17a9402d) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit compiled on Oct 19 2023 16:40:25 Bhuvne
sh C.
(1 row)

postgres=# show transaction_isolation ;
 transaction_isolation
-----------------------
 read committed
(1 row)


[postgres@mdw ~]$ uname -a
Linux mdw 5.4.258-1.el7.elrepo.x86_64 #1 SMP Tue Oct 10 22:42:49 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

yanboer avatar Feb 29 '24 06:02 yanboer

Thanks @yanboer for reporting. We will look into it. Quick question:

  • what is the value of the GUC gp_enable_global_deadlock_detector?
  • what is the value of the GUC optimizer in your system?

kainwen avatar Feb 29 '24 08:02 kainwen

[postgres@mdw log]$ gpconfig -s gp_enable_global_deadlock_detector Values on all segments are consistent GUC : gp_enable_global_deadlock_detector Coordinator value: on Segment value: on [postgres@mdw log]$ gpconfig -s optimizer Values on all segments are consistent GUC : optimizer Coordinator value: off Segment value: off

yanboer avatar Feb 29 '24 08:02 yanboer

Reproduced this BUG by opening the GUC of gp_enable_global_deadlock_detector:

gpconfig -c gp_enable_global_deadlock_detector -v on

nohup pgbench -U smart postgres -n -T 30 -c 32 -j 2 -f test_transaction.sql > greenplum_pgbench.log 2>&1

grep -i "WARNING" greenplum_pgbench.log | wc -l
336

After turning it off, you will not find any inconsistent situation. So I guess the culprit is the implementation of global deadlock detector.

SmartKeyerror avatar Mar 01 '24 02:03 SmartKeyerror

What are the negative effects of turning off gp_enable_global_deadlock_detector? Like a lock or something

yanboer avatar Mar 01 '24 04:03 yanboer

What are the negative effects of turning off gp_enable_global_deadlock_detector? Like a lock or something

Impact OLTP performance a lot. Because update|delete on heap table will hold high level lock to make sure no global deadlock will happen.

kainwen avatar Mar 01 '24 04:03 kainwen

Thanks Reply

yanboer avatar Mar 01 '24 06:03 yanboer

The RCA of this problem is mentioned by the PR https://github.com/greenplum-db/gpdb/pull/9452, but that PR only returns waitGxids after the prepare stage. However, we should return the waitGxids every DML statement if it has a lock waiting in QD:

image

Here is a simple diagram to explain what's going on about this issue. In this case, T1 and T2 both will update the same tuple, let's say it Tuple-0, and T1 is updating first, T2 will wait for the tuple lock holding by T1.

After T1 commits, QE will release the lock of Tuple-0, and then wake up T2 to continue. T2 executes more quickly than T1 who is returning the commit results to QD, and executes another SELECT statement. When T2 gets distributed snapshot, it will regard T1 as in progress, since T1 is not returning commit results to QD, so it will get wrong results from QE.

The SQL statement execution order will be:

T1: update test set name = 'foo' where id = 1;
T1: execute two-phase commit, prepare success

T2: update test set name = 'bar' where id = 1;
-- wait for T1 releasing the lock of Tuple-0

T1: execute commit stage, QE release lock, return result to QD
T2: get lock, update tuple
T2: run select statement, T1 is not global committed (Actually it's local committed)

The resolution is quite simple, as the below diagram shows:

image

For One-Phase-Optimization-Commit, we just have one solution:

image

However, after I add this patch fix-concurrent-update, we still can not pass the pgbench test.

After some investigations, I found the patch #13810 is not correct, we can not regard the distributed transaction who's gxid is larger than xmax is invisible:

if (gxid == InvalidDistributedTransactionId || gxid >= xmax)
    continue;

And the codes in DistributedSnapshotWithLocalMapping_CommittedTest is not correct too:

	/*
	 * Any xid >= xmax is in-progress, distributed xmax points to the
	 * latestCompletedGxid + 1.
	 */
	if (distribXid >= ds->xmax)
	{
		elog((Debug_print_snapshot_dtm ? LOG : DEBUG5),
			 "distributedsnapshot committed but invisible: distribXid "UINT64_FORMAT" dxmax "UINT64_FORMAT" dxmin "UINT64_FORMAT" distribSnapshotId %d",
			 distribXid, ds->xmax, ds->xmin, ds->distribSnapshotId);

		return DISTRIBUTEDSNAPSHOT_COMMITTED_INPROGRESS;
	}

The fundamental reason is that there is inconsistency in the updates of distributed transaction status between QD and QE. QD always lags behind QE's status updates.

However, when we remove this part of the code, the problem still exists.

The third problem is:

image

The global transaction of T2 runs much faster than T1, and the update is completed before T1 updates Tuple-0, but T2 is slower when returning the commit result to QD. At the same time, T1 completes the update and executes the select statement. At this time, although the T2 local transaction has been committed, the T2 global transaction status has not been updated, so T1 thinks that T2 is still executing, so it will also get the wrong result.


This is all the information and progress I have at this time, there is no conclusion yet on how to fix issue 3.

SmartKeyerror avatar Mar 18 '24 07:03 SmartKeyerror

Hello, I am not very familiar with Greenplum's code, so I can hardly give any specific suggestions.

Thank you for your contribution to this issue.

yanboer avatar Mar 18 '24 09:03 yanboer

Gave more notes after I did some tests and read related code.

Reproduce

A simpler reproduce steps (with gdb):

-- enable GDD and set isolation level to RC first

-- create test table
demo=# create table t (id int, val text);
demo=# insert into t values(1, 'init');

-- session1 (T1)
demo=# begin;

-- session2 (T2)
gdb attached to QD:
  b at cdbtm.c:527 and continue

demo=# update t set val='two' where id=1;

then QD stop at:
  Thread 1 "postgres" hit Breakpoint 1, doNotifyingOnePhaseCommit () at cdbtm.c:527
  527		if (!succeeded)

don't continue, just let QD stop here: 
  So, QE's local Txn has been committed but QD didn't deal with the response.
  It simulated the response delay scenario, means the global Txn is still in progress.

-- back to session1
-- cannot see the commited rows of T2: it's ok, since the global Txn is still in progress
demo=# select gp_segment_id,xmin,xmax,* from t;
 gp_segment_id | xmin  | xmax  | id | val
---------------+-------+-------+----+------
             1 | 21421 | 21422 |  1 | init
(1 row)

demo=# update t set val='one' where id=1;
UPDATE 1

-- reproduce the abnormal behavior: should not see the 'init' tuple
demo=# select gp_segment_id,xmin,xmax,* from t;
 gp_segment_id | xmin  | xmax  | id | val
---------------+-------+-------+----+------
             1 | 21421 | 21422 |  1 | init
             1 | 21423 | 21423 |  1 | one
(2 rows)

demo=# set gp_select_invisible to on;
SET
demo=# select gp_segment_id,xmin,xmax,* from t;
 gp_segment_id | xmin  | xmax  | id | val
---------------+-------+-------+----+------
             1 | 21421 | 21422 |  1 | init
             1 | 21422 | 21423 |  1 | two
             1 | 21423 | 21423 |  1 | one
(3 rows)

-- still see it even if the T1 is finished
demo=# commit;
COMMIT
demo=# set gp_select_invisible to off;
SET
demo=# select gp_segment_id,xmin,xmax,* from t;
 gp_segment_id | xmin  | xmax  | id | val
---------------+-------+-------+----+------
             1 | 21421 | 21422 |  1 | init
             1 | 21423 | 21423 |  1 | one
(2 rows)

-- back to normal after let QD of T2 continue
  gdb --pid=154062
  Continuing.

demo=# select gp_segment_id,xmin,xmax,* from t;
 gp_segment_id | xmin  | xmax  | id | val
---------------+-------+-------+----+-----
             1 | 21423 | 21423 |  1 | one
(1 row)

Explanation

Why it happened? In brief, due to the inconsistent status between global txn and local txn (as Zhenglong wrote before).

The update statement of T1:

-- update t set val='one' where id=1;
Plan:
    Update on t
       ->  Seq Scan on t  Filter: (id = 1)
  • Seq Scan Node: use HeapTupleSatisfiesMVCC(), returns 'init' tuple to Update Node (since T2's global txn status is in progress)
  • Update Node: want to delete the 'init' tuple and add a new tuple. But it uses HeapTupleSatisfiesUpdate() (no DistribSnapshot) to notice 'init' tuple is TM_Updated, then step into its latest row version 'two' tuple, then delete it (two) and add a new tuple (one).

The following select statement of T1:

init --> deleted by T2, but T2 is in progress, can see it 
two --> deleted by T1, cannot see it
one --> added by T1, can see it

Postgres is no problem since it doesn't have global txn, so a tuple's related Txn status is already clear (committed or aborted) when HeapTupleSatisfiesUpdate() found it is TM_Updated.

Fix

How to fix it? Here are my initial thoughts:

First, we need to decide whether the select statement of T1 treats tuple "two" as committed?

  • Option1: treat it as in-progress (haven't commit yet), then T1 have to block at the tuple "two" until the status of T1 have updated in QD.
  • Option2: treat it as committed and no block, then T1 have to revise the status of T2 as committed in QE.

Another important point is: Since we have to be aware of the inconsistent status of one Txn, how to detect it and where should the detection logic be placed? (e.g. in HeapTupleSatisfiesUpdate or HeapTupleSatisfiesMVCC)

And consider the performance in the last step (make it right first).

Anyway, let's consider the problem clearly and get a total solution for it, rather than patchs for case by base.

(I already have some fix thoughts, will post it after do more verifications)

Others

If we set isolation level to RR, the update statement of T1 reported an error could not serialize access due to concurrent update

interma avatar Apr 26 '24 09:04 interma

Follow up of my https://github.com/greenplum-db/gpdb/issues/17134#issuecomment-2079032878, continue to post my fix idea here:

  1. Choose the option2: > treat it as committed and no block, then T1 have to revise the status of T2 as committed in QE.
  2. Aware the abnormal tuple in the process of update (in heapam_tuple_update()) and continue to update. (treat it as committed)
  3. Report the Gxid to QD at the end of current transaction command.
  4. QD get these Gxids from dispatched results, then wait the corresponding these Gtxs to finish. (step3 and 4 reuse sendWaitGxidsToQD() and GxactLockTableWait() mechanism in https://github.com/greenplum-db/gpdb/commit/39fbfe96c9edbc38b0d3d7cc7b8e05447f56eb5e)
  5. After all waited Gtxs are finished, QD continue to execute the next transaction command (the mvcc snapshot is correct now)

TLDR: aware abnormal Gtx status in current transaction command and let QD wait for necessary Gtxs to finish, then mvcc snapshot is correct in the next transaction command.

And here is my demo fix code (it's a rough code, must have bugs and need to refine later, just use to show my idea): https://github.com/greenplum-db/gpdb/compare/main...interma:my_tempfix_concurrent_udpate?expand=1

It can pass the test script of this issue.

(but it still failed the test case of https://github.com/greenplum-db/gpdb/issues/17203, I think it's another abnormal scenario, I will take a look later).

This issue is a very hard one, posted so many words as a discussion basis. Let's free to discuss deeply.

interma avatar May 06 '24 08:05 interma