cloudberry Dispatch by shared memory

This PR makes serializedPlantree and serializedQueryDispatchDesc dispatched by shared memory instead of interconnect, so that they can be sent only once to writer QE, and synced inbetween reader QEs and writer QE on a segment through DSM. It has been discussed in https://github.com/orgs/cloudberrydb/discussions/243.

Implementation Outline

this PR mainly use polling on reader QE to wait shared plan get's dispatched from writer QE. this is the same mechanism as shared snapshot synchronization. to circumvent this we may need something special (e.g., signal) because seems current synchronization mechanisms (barrier, SharedLatch) cannot satisfy our requirement.
to properly reclaim DSM segments for a query, a reference count is calculated on writer QE. and the last reader QE who reads it will reclaim the DSM. this is the same as parallel (GpInsertParallelDSMHash).
to isolate states in each user connections, a shmem HTAB is used. also same as parallel.

Prerequisites

This feature is only enabled when: 1) current query is not an extended query (cursor, or Bind messages, etc.), and 2) there exists a gang in which all QEs are writer QE (notably, writer QE and writer gang are two different concepts).

Prerequisite 1

prerequisite 1 is a hard limit due to the way extended query (equery for short) works. during equery, there's always a live writer gang in which everyone's a writer QE (gang W). first a command set gp_write_shared_snapshot=true will be dispatched to gang W to force a shared snapshot sync, then the actual gang will be created in which everyone's reader (gang R) and the actual query gets dispatched to it. Immediately you can find out that when the actual query get's dispatched, no writer QE receives the plan (because all writer QEs are in gang W), so there's no one be responsible for shared query plan synchronization.

Prerequisite 2

prerequisite 2 is a tradeoff. Consider the following query plan:

In this plan, seg0 in slice1 is a writer that should receives full query text. but in slice2, seg0 is a reader that should receive slimQueryText (a slimQueryText is a query text w/o query plan and ddesc), and seg1 and seg2 are writer QEs. this means when dispatching to gang2, seg0 should receives slim query text (because the full plan can be synced to it from seg0 in slice1 which is a writer), but seg1 and seg2 should receives full query text. this poses challenges because on QD side, the current interface of cdb dispatcher limits all dispatches happens on a per-gang basis (cdbdisp_dispatchToGang) and plan cannot be changed from seg to seg in a gang. for the same reason, the reference count of a DSM segment cannot be dispatched from QD directly because it may be different from QE to QE, even they're in the same gang (consider a plan that have singleton reader).

We can surely workaround this by bringing more thorough refactor to cdb/dispatcher interfaces. but I don't think that's worth it though surely this's debatable.

Other Caveats

Updatable Views

It may seem that the following invariant holds for any given query:

For any segment that a plan touches, there's always a writer QE exists on that segment.

This is indeed true for many common queries, but unfortunately not all of them. Below is a counterexample:

`InitPlan`

If there's aInitPlan at the root of a plan, there could be two set of writer gang created and two rounds of dispatching happens for the same query:

this is why we limit "same root" during reference calculation (https://github.com/Ray-Eldath/cloudberrydb/blob/dispatch-by-shmem/organized-and-unlogged/src/backend/utils/time/sharedqueryplan.c#L119-L121). Note that a InitPlan doesn't necessarily have to be at the root. It could be deep down the plan tree as well:

Possible Outcome

All in all, though many requirements need to be meet for this feature to take effects, it is still very much turned on in most common queries (see tests for example). This is good news. On the bad side, I doubt whether this PR can make any noticeable performance improvements at all. On qd, we cannot completely get rid of libpq connections for now, and query dispatch is already pipelined to hide interconnect cost anyway. In the long run, if we are to "decentralize" QE by reassigning tasks (such as creating reader QEs, keepalive, etc.) to writer QE, this feature is a very good pathfinder and also a mandatory requisite. But if that's not the case, I doubt whether this feature alone worths the risk.

Dec 18 '23 04:12 Ray-Eldath

else if (Gp_role == GP_ROLE_EXECUTE)
	{
		if (Gp_is_writer)
		{
			addSharedSnapshot("Writer qExec", gp_session_id);
		}
		else
		{
			/*
			 * NOTE: This assumes that the Slot has already been
			 *       allocated by the writer.  Need to make sure we
			 *       always allocate the writer qExec first.
			 */			 			
			lookupSharedSnapshot("Reader qExec", "Writer qExec", gp_session_id);
		}
	}

can reuse SharedSnapshot logic and code ?

Dec 18 '23 08:12 yjhjstz

On the bad side, I doubt whether this PR can make any noticeable performance improvements at all. On qd, we cannot completely get rid of libpq connections for now, and query dispatch is already pipelined to hide interconnect cost anyway. In the long run, if we are to "decentralize" QE by reassigning tasks (such as creating reader QEs, keepalive, etc.) to writer QE, this feature is a very good pathfinder and also a mandatory requisite. But if that's not the case, I doubt whether this feature alone worths the risk.

Agree, we have a long way to go, but this PR is a good start! Thanks for your work and rich description.

Dec 20 '23 03:12 avamingli

else if (Gp_role == GP_ROLE_EXECUTE)
	{
		if (Gp_is_writer)
		{
			addSharedSnapshot("Writer qExec", gp_session_id);
		}
		else
		{
			/*
			 * NOTE: This assumes that the Slot has already been
			 *       allocated by the writer.  Need to make sure we
			 *       always allocate the writer qExec first.
			 */			 			
			lookupSharedSnapshot("Reader qExec", "Writer qExec", gp_session_id);
		}
	}

can reuse SharedSnapshot logic and code ?

the original implementation was to use SharedSnapShot, but I quickly gave it up. SharedSnapshot served for a totally different purpose and its lifetime is tied to a transaction not a query. when dtxcontext is AUTO_COMMIT_IMPLICIT, txn lifetime is query lifetime, but when you explicitly use txn with BEGIN stmt etc., they're different. we can of course make this work but I didn't go down that path. I can try again if you think that's better.

also I can never figure out why they use a array to store slots discriminated by gp_session_id. use ShmemSharedHTAB as this PR and parallel can lead to a far simpler implementation. Do you have a theory on this? @yjhjstz

Dec 27 '23 10:12 Ray-Eldath

the original implementation was to use SharedSnapShot, but I quickly gave it up. SharedSnapshot served for a totally different purpose and its lifetime is tied to a transaction not a query. when dtxcontext is AUTO_COMMIT_IMPLICIT, txn lifetime is query lifetime, but when you explicitly use txn with BEGIN stmt etc., they're different. we can of course make this work but I didn't go down that path. I can try again if you think that's better.

I found they have the same logical except lifetime.

Dec 27 '23 10:12 yjhjstz

Close the PR since no response for a long time.

Aug 26 '24 06:08 my-ship-it