gpdb
gpdb copied to clipboard
Truncate operation interleaves between END_OF_RECOVERY and CHECKPOINT may panic for AO table.
Bug Report
Greenplum version or build
6X
OS version and uname -a
Linux
autoconf options used ( config.status --config )
Installation information ( pg_config )
Expected behavior
No panic.
Actual behavior
Panic happens on the promoting mirror.
2021-10-20 22:14:14.114115 CST,,,p5666,th-1398155200,,,,0,,,seg0,,,,,"PANIC","58P01","could not fsync file ""base/16384/57361.1"" (is_ao: 1): No such file or directory",,,,,,,0,,"md.c",1331,"Stack trace:
1 0xae230c postgres errstart (elog.c:557)
2 0x96d6ff postgres mdsync (md.c:1328)
3 0x96f4b3 postgres smgrsync (smgr.c:638)
4 0x9315de postgres CheckPointBuffers (bufmgr.c:2008)
5 0x55aac0 postgres
Step to reproduce the behavior
turn on fsync
CREATE TABLE t1(id int) WITH (appendonly=true); CHECKPOINT; INSERT INTO t1 SELECT generate_series(1, 100); -- kill the primary segment SELECT pg_ctl(datadir, 'stop', 'immediate') FROM gp_segment_configuration WHERE role='p' AND content = 0; -- truncate the table, loop until success TRUNCATE t1; TRUNCATE t1; TRUNCATE t1; CHECKPOINT; <--- PANIC at this step
On 6X, if TRUNCATE interleaves between END_OF_RECOVERY and its following CHECKPOINT, after COMMIT_PREPARED, will unlink the old AO segment file but the later CHECKPOINT will queue and apply the fsync to this removed AO file, resulting the PANIC. 6X doesn't forget fsync request for AO table before unlink in active mode (https://github.com/greenplum-db/gpdb/blob/6X_STABLE/src/backend/storage/smgr/md.c#L446), while seems it was handled on master branch.
Reopen it, it is a different issue, #12699 cannot fix it.
More investigation, primary performs insert fsync for AO segment via backend other than checkpointer, while mirror handles insert fsync during redo by enqueueing fsync request to checkpointer. The problem happends when the mirror is just promoted and truncate comes in the middle of end_of_recovery and subsequent checkpoint. At this moment, the insert fsync is already enqueued to checkpointer during mirror mode, but currently it is primary mode, no ForgetFsync to tell checkpointer to ignore the previous fsync before unlink the AO segment file, resulting a PANIC.
It seems to be a corner case and no persistent impact.
Not found the issue on master branch.
The condition used to offload or clear fsync work to checkpointer or backend for AO table is majorly based IsStandbyMode() || InRecovery
. Lets put timeline to make the window clear and then access the impact.
- Mirror is promoted
- All existing WAL records are replay completed <-- this step involves AO operations to queue fsync request with checkpointer
-
END_OF_RECOVERY
was record is written to WAL -
InRecovery = false
is set <--- this flag is useful only for startup process and not for regular backends as after this step AO operations will not queue or clear fsync request - Open for accepting connections mainly setting
xlogctl->SharedRecoveryInProgress
which allows WAL to be written by incoming connections. - CHECKPOINT
If TRUNCATE or DROP table happens before next checkpoint comes, fsync will fail with file missing. Agree its "a corner case and no persistent impact" as will be one time crash which can happen only during mirror promotion and not rolling PANIC case as on crash recovery the things will be handled fine. During crash recovery, only after writing the end of recovery checkpoint, InRecovery = false
is set.
PANIC is not good specially happens on current acting primary and hence we should continue to explore and avoid the same even if corner case. On busy system the window between step 5 and step 6 can be big as checkpoint happens in async fashion and hence we will have to solve the same.
Possible Solutions: A. Don't do fast promotion which will make sure checkpoint is performed before opening connections similar to crash recovery. Seems reasonable though delays the failover slightly and defeats the purpose fast promote was introduced.
B. Queue fsync clear requests for AO tables always irrespective of IsStandbyMode() || InRecovery
or not. Unnecessary queuing of fsync requests on primary for AO tables and filling the fsync queue.
C. Add mechanism to know window till first checkpoint after promotion is performed (step 5 to step 6) and continue to forward request only for that window to checkpointer. Optimized version of solution B. Need to convey that mechanism via some shared memory state between checkpointer process and backends.
D. As part of CreateEndOfRecoveryRecord()
perform fsync operations for AO tables and clear the fsync queue for AO tables. No such functionality exists as needs co-ordination between startup and checkpointer process (as only checkpointer should be performing the clearance of fsync queue as it owns the pendingOpsTable).
I don't like B due to performance and complexity to regular primary work, C and D due to code complexity and unnecessary more divergence code from upstream. Solution A seems simplest to implement and straightforward but comes with slight impact on failover times to cover this corner case. Posting the thoughts to capture current snapshot of mind, no outcome yet. Lets continue to think and see how to take this forward.
Both option A and B introduce a different behavior(or impact) comparing with the original design, although seems to be harmless.
Looks like option C is between them, which limits the difference in a small window. If we can find a simple and clean solution to identify this small window, that might be the appropriate direction to move forward ?
As I can see if we want to keep the fast_promoted functionality, we have to handle ForgetFsync-before-unlink since any truncate-like operation may come in between the last couple of (end_of_recovery & checkpoint) records.
Option D seems introducing more complexity, not in my list currently.
It seems to be a corner case and no persistent impact.
Not found the issue on master branch.
Closing this issue won't fix it for now. Extreme corner/rare case and impact is super low. Possible solutions to fix the same are complex. Not worth fix for now.
Reopened as someone does hit a similar issue now.