gpdb Support distributed transaction isolation for hot standby

Support distributed transaction isolation for hot standby

Open huansong opened this issue 1 year ago • 7 comments

PR description

This PR firstly made two pre-requisite changes:

Request syncrep for the forget commit in the remote_apply mode 
Write nextGxid as-is in checkpoint

Please review each commit for details.

Then, the GPDB support of read-committed isolation in hot standby is straightforward: we just need to take care of the bookkeeping of distributed transactions: (1) Initialize latestCompletedGxid during StartupXLOG, and update it while the standby replays new transactions. (2) Construct an in-progress dtx array when creating distributed snapshot according to the shmCommittedGxidArray[] we already keep in the standby.
On top of that, we can support repeatable-read isolation. The only real complication is just to support the BEGIN...END block. The snapshot selection and usage for repeatable-read on a hot standby is exactly the same as a primary. The main difference between a single-statement transaction and a BEGIN...END block is just the DTX context of the QEs: in the former case the QEs are DTX_CONTEXT_QE_AUTO_COMMIT_IMPLICIT, but in the latter case they are DTX_CONTEXT_QE_TWO_PHASE_EXPLICIT_WRITER (see setupQEDtxContext()). We had Assert/ERROR in the code to assume that for EXPLICIT_WRITER, there's always a valid distributed xid for that transaction. However, that is not the case for hot standby: a standby never allocates an xid and there's definitely no use of an xid in its BEGIN...END block. Therefore, all we need to do is just make sure to not apply this assumption to hot standby. After that, supporting repeatable-read is a no-op.
Finally, we will be enabling the upstream hot standby regress tests and running that in the pipeline.
Another "add-on" feature is query conflict detection and cancellation. For the most part, GPDB does not have much more else to do than what the upstream code already provides. Please see the last two commits for the related change & tests.

Bookkeeping of distributed running transaction

A major chunk of effort for the upstream hot standby support was to bookkeeping the state of transactions that are running on the primary. In GPDB, we run distributed transactions. But for the hot standby support, we do not actually need to do a lot to emulate the "distributed running transaction", at least ATM. Here, we discuss the difference in the C struct/variables needed for bookkeeping non-distributed vs distributed transactions:

Firstly, we are not emulating KnownAssignedXids for dtx since we already have the shmCommittedGxidArray which can serve the isolation purpose.
Secondly, we don't need to emulate XLOG_RUNNING_XACTS which records the running transaction at checkpoint, since we already have that for dtx as part of the extended checkpoint (see TMGXACT_CHECKPOINT) which has been used for dtx recovery. That same information can be used for standby for dtx bookkeeping across checkpoints. Any dtx that is not in that list are regarded either not started or already committed and we can safely ignore them for the purpose of creating distributed snapshot.
- A few more words about the dtx that are "not started": that includes dtx that have been assigned gxid but "not-prepared". This is different than upstream which views a transaction "in-progress" as long as an xid has been assigned. We are OK with ignoring "not-prepared dtx" because those dtx do not have gxid written in WAL so the standby will not be able to conduct a dtx visibility check anyway. We will only rely on local visibility check in that case.
Thirdly, we don't need to emulate other types of information we record for normal transaction (refer to RunningTransactionsData):
- subxcnt and subxid_overflow: there's no concept of sub-dtx.
- latestCompletedXid: initializing latestCompletedGxid to nextGxid seems to be enough for MVCC.
- nextXid: used for updating latestObservedXid. It doesn't seem like a latestObservedGxid will be used for distributed snapshot creation.
- oldestRunningXid: used for pruning KnownAssignedXids. We don't seem to need to prune dtx since we always have to recover any prepared-but-not-committed dtx. We shouldn't prune/remove them since we expect them to eventually complete.

Known limitations

Snapshot conflict with AO/CO tables/indexes are not yet supported. We'll do that in a separate PR.

Other than that, we have a couple of limitations that can only (or at least the easiest) be resolved by a restore-point based approach for dtx isolation.

It is possible that different segment have replayed to different WAL records at a given time, so the query results returned to the standby QD might not follow transaction atomicity. In order to solve this issue, we need to make sure the dtx snapshot is only taken after every standby segment has replayed sufficient WALs corresponding to the dtx that the standby coordinator has replayed. The way to achieve that is through a restore-point based approach which we are going to do later.
The hot standby won't see the last 1PC (or the last few 1PCs if they are all 1PC). This is because since 1PC does not have any WAL on QD, the standby QD won't advance its latestCompletedGxid, so its distributed snapshot horizon does not include the last 1PC - it would view the last 1PC not yet started or at best still in progress. Only if another 2PC comes, the standby would advance its latestCompletedGxid and its distributed snapshot will include the previous 1PC.
Distributed snapshot is not taken into account when detecting snapshot conflict. This problem can be shown in a test case in isolation2/hot_standby/query_conflict. To address such problem, we need every standby segment to record their local snapshot when distributed snapshot is being created. For now, we regard it as a limitation and only address it with a restore-point based dtx snapshot approach which we will implement soon.
DLOG might be truncated on primary and the standby will truncate the same through WAL replay. However, the standby might still be running a distributed snapshot with the gxid in the truncated DLOG. The reason is that the primary currently truncates DLOG during advancing oldestXmin, which only considers distributed snapshot on the primary itself. With an RP-based approach, it is currently thought that we don't really need the distributed snapshot visibility, hence no need of the DLOG.

Thanks to @soumyadeep2007 and @jimmyyih for many of the ground laying work.

Pipeline: https://dev.ci.gpdb.pivotal.io/teams/main/pipelines/hsd

P.S. plan for hot standby support in GPDB:

Enable hot standby dispatch on a mirrored cluster [done]
Support dtx snapshot isolation for hot standby [in-progress]
Handle hot standby query conflict [in-progress]
Enable restore point based snapshot isolation [not started]
Support snapshot conflict for AO/CO table and index on hot standby [not started]

Here are some reminders before you submit the pull request

[ ] Add tests for the change
[ ] Document changes
[ ] Communicate in the mailing list if needed
[ ] Pass make installcheck
[ ] Review a PR in return to support the community

Jan 19 '24 17:01 huansong

gpdb gpdb copied to clipboard

Support distributed transaction isolation for hot standby

PR description

Bookkeeping of distributed running transaction

Known limitations

Here are some reminders before you submit the pull request

gpdb
gpdb copied to clipboard