gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

Truncate operation interleaves between END_OF_RECOVERY and CHECKPOINT may panic for AO table.

Open haolinw opened this issue 3 years ago • 7 comments

Bug Report

Greenplum version or build

6X

OS version and uname -a

Linux

autoconf options used ( config.status --config )

Installation information ( pg_config )

Expected behavior

No panic.

Actual behavior

Panic happens on the promoting mirror.

2021-10-20 22:14:14.114115 CST,,,p5666,th-1398155200,,,,0,,,seg0,,,,,"PANIC","58P01","could not fsync file ""base/16384/57361.1"" (is_ao: 1): No such file or directory",,,,,,,0,,"md.c",1331,"Stack trace: 1 0xae230c postgres errstart (elog.c:557) 2 0x96d6ff postgres mdsync (md.c:1328) 3 0x96f4b3 postgres smgrsync (smgr.c:638) 4 0x9315de postgres CheckPointBuffers (bufmgr.c:2008) 5 0x55aac0 postgres (xlog.c:9238) 6 0x55a278 postgres CreateCheckPoint (xlog.c:8962) 7 0x8d2d08 postgres CheckpointerMain (checkpointer.c:524) 8 0x5af5da postgres AuxiliaryProcessMain (bootstrap.c:453) 9 0x8e48fe postgres (postmaster.c:5846) 10 0x8e423e postgres (postmaster.c:5506) 11 0x7f0fab582630 libpthread.so.0 + 0xab582630 12 0x7f0faad82a13 libc.so.6 __select + 0x13 13 0x8df572 postgres (postmaster.c:1892) 14 0x8dec6d postgres PostmasterMain (postmaster.c:1518) 15 0x7d70f5 postgres (main.c:264) 16 0x7f0faacaf555 libc.so.6 __libc_start_main + 0xf5 17 0x48d4d9 postgres + 0x48d4d9

Step to reproduce the behavior

turn on fsync

CREATE TABLE t1(id int) WITH (appendonly=true); CHECKPOINT; INSERT INTO t1 SELECT generate_series(1, 100); -- kill the primary segment SELECT pg_ctl(datadir, 'stop', 'immediate') FROM gp_segment_configuration WHERE role='p' AND content = 0; -- truncate the table, loop until success TRUNCATE t1; TRUNCATE t1; TRUNCATE t1; CHECKPOINT; <--- PANIC at this step

haolinw avatar Oct 20 '21 07:10 haolinw

On 6X, if TRUNCATE interleaves between END_OF_RECOVERY and its following CHECKPOINT, after COMMIT_PREPARED, will unlink the old AO segment file but the later CHECKPOINT will queue and apply the fsync to this removed AO file, resulting the PANIC. 6X doesn't forget fsync request for AO table before unlink in active mode (https://github.com/greenplum-db/gpdb/blob/6X_STABLE/src/backend/storage/smgr/md.c#L446), while seems it was handled on master branch.

haolinw avatar Oct 20 '21 08:10 haolinw

Reopen it, it is a different issue, #12699 cannot fix it.

haolinw avatar Oct 21 '21 11:10 haolinw

More investigation, primary performs insert fsync for AO segment via backend other than checkpointer, while mirror handles insert fsync during redo by enqueueing fsync request to checkpointer. The problem happends when the mirror is just promoted and truncate comes in the middle of end_of_recovery and subsequent checkpoint. At this moment, the insert fsync is already enqueued to checkpointer during mirror mode, but currently it is primary mode, no ForgetFsync to tell checkpointer to ignore the previous fsync before unlink the AO segment file, resulting a PANIC.

It seems to be a corner case and no persistent impact.

Not found the issue on master branch.

haolinw avatar Oct 22 '21 03:10 haolinw

The condition used to offload or clear fsync work to checkpointer or backend for AO table is majorly based IsStandbyMode() || InRecovery. Lets put timeline to make the window clear and then access the impact.

  1. Mirror is promoted
  2. All existing WAL records are replay completed <-- this step involves AO operations to queue fsync request with checkpointer
  3. END_OF_RECOVERY was record is written to WAL
  4. InRecovery = false is set <--- this flag is useful only for startup process and not for regular backends as after this step AO operations will not queue or clear fsync request
  5. Open for accepting connections mainly setting xlogctl->SharedRecoveryInProgress which allows WAL to be written by incoming connections.
  6. CHECKPOINT

If TRUNCATE or DROP table happens before next checkpoint comes, fsync will fail with file missing. Agree its "a corner case and no persistent impact" as will be one time crash which can happen only during mirror promotion and not rolling PANIC case as on crash recovery the things will be handled fine. During crash recovery, only after writing the end of recovery checkpoint, InRecovery = false is set.

PANIC is not good specially happens on current acting primary and hence we should continue to explore and avoid the same even if corner case. On busy system the window between step 5 and step 6 can be big as checkpoint happens in async fashion and hence we will have to solve the same.

Possible Solutions: A. Don't do fast promotion which will make sure checkpoint is performed before opening connections similar to crash recovery. Seems reasonable though delays the failover slightly and defeats the purpose fast promote was introduced.

B. Queue fsync clear requests for AO tables always irrespective of IsStandbyMode() || InRecovery or not. Unnecessary queuing of fsync requests on primary for AO tables and filling the fsync queue.

C. Add mechanism to know window till first checkpoint after promotion is performed (step 5 to step 6) and continue to forward request only for that window to checkpointer. Optimized version of solution B. Need to convey that mechanism via some shared memory state between checkpointer process and backends.

D. As part of CreateEndOfRecoveryRecord() perform fsync operations for AO tables and clear the fsync queue for AO tables. No such functionality exists as needs co-ordination between startup and checkpointer process (as only checkpointer should be performing the clearance of fsync queue as it owns the pendingOpsTable).

I don't like B due to performance and complexity to regular primary work, C and D due to code complexity and unnecessary more divergence code from upstream. Solution A seems simplest to implement and straightforward but comes with slight impact on failover times to cover this corner case. Posting the thoughts to capture current snapshot of mind, no outcome yet. Lets continue to think and see how to take this forward.

ashwinstar avatar Oct 22 '21 21:10 ashwinstar

Both option A and B introduce a different behavior(or impact) comparing with the original design, although seems to be harmless.

Looks like option C is between them, which limits the difference in a small window. If we can find a simple and clean solution to identify this small window, that might be the appropriate direction to move forward ?

As I can see if we want to keep the fast_promoted functionality, we have to handle ForgetFsync-before-unlink since any truncate-like operation may come in between the last couple of (end_of_recovery & checkpoint) records.

Option D seems introducing more complexity, not in my list currently.

haolinw avatar Nov 01 '21 07:11 haolinw

It seems to be a corner case and no persistent impact.

Not found the issue on master branch.

Closing this issue won't fix it for now. Extreme corner/rare case and impact is super low. Possible solutions to fix the same are complex. Not worth fix for now.

ashwinstar avatar Dec 14 '22 19:12 ashwinstar

Reopened as someone does hit a similar issue now.

haolinw avatar Apr 10 '24 09:04 haolinw