gpdb
gpdb copied to clipboard
5X: could not fsync segment N of relation N/N/N: No such file or directory
Greenplum version or build
OS version and uname -a
Red Hat Enterprise Linux Server release 7.6 (Maipo)
Linux ... 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Installation information
pg_config
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/lib/gpdb/include
CFLAGS = -O3 -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -fno-aggressive-loop-optimizations -Wno-unused-but-set-variable -Wno-address -I/usr/lib/gpdb/include
CFLAGS_SL = -fpic
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/usr/lib/gpdb/lib',--enable-new-dtags -L/usr/lib/gpdb/lib
LDFLAGS_SL =
LIBS = -lpgport -lgpopt -lnaucrates -lgpdbcost -lgpos -lxerces-c -lxslt -lxml2 -lrt -lkrb5 -lcom_err -lgssapi_krb5 -lzstd -lcrypt -ldl -lm -L/usr/lib/gpdb/lib
VERSION = PostgreSQL 8.3.23
The build is without asserts and debug information.
Actual behaviour
Two mirrors in a ~500 (including mirrors) segments cluster fail.
gprecoverseg is run then. The utility itself runs successfully; however, the resync process fails with the following errors:
- Primary
2020-09-20 00:00:00.278247 MSK,,,p705,th657053824,,,,0,,,seg-1,,,,,"ERROR","58P01","could not fsync segment 0 of relation 16734/16384/15730654: No such file or directory",,,,,,,0,,"md.c",2058,"Stack trace:
1 0x935f06 postgres errstart + 0x1f6
2 0x812f99 postgres mdsync + 0x159
3 0x7f5c1b postgres CheckPointBuffers + 0xbb
4 0x527afd postgres CreateCheckPoint + 0x47d
5 0x7b1d21 postgres CheckpointerMain + 0x2c1
6 0x580125 postgres AuxiliaryProcessMain + 0x475
7 0x7ba610 postgres <symbol not found> + 0x7ba610
8 0x7ba8b2 postgres <symbol not found> + 0x7ba8b2
9 0x7bca8d postgres <symbol not found> + 0x7bca8d
10 0x7c3312 postgres <symbol not found> + 0x7c3312
11 0x7c4702 postgres PostmasterMain + 0xc62
12 0x4cc1fd postgres main + 0x39d
13 0x7f53229533d5 libc.so.6 __libc_start_main + 0xf5
14 0x4cc7ef postgres <symbol not found> + 0x4cc7ef
"
- Mirror
2020-09-19 01:01:16.006443 MSK,,,p56738,th1478842496,,,,0,,,seg-1,,,,,"LOG","00000","'unlink failed identifier '/data2/mirror/gpseg0/pg_xlog/' errno '21' ', mirroring role 'mirror role' mirroring state 'resync' segment state 'up and running' filerep state 'up and running' process name(pid) 'mirror consumer process(56738)' 'cdbfilerepmirror.c' 'L2490' 'FileRepMirror_DropFilesFromDir'",,,,,,,0,,"cdbfilerep.c",1824,
2020-09-19 01:01:17.884089 MSK,,,p56737,th1478842496,,,2000-01-01 03:00:00 MSK,0,,,seg-1,,,,,"WARNING","XX000","receive close on connection: Success (cdbfilerepconnserver.c:343)",,,,,,,0,,"cdbfilerepconnserver.c",343,
Solution
The error originates from https://github.com/greenplum-db/gpdb/blob/eb3677e9921499cb403e288c83074df37cf704c0/src/backend/storage/smgr/md.c#L2052
Its direct cause is the presence of an entry
https://github.com/greenplum-db/gpdb/blob/eb3677e9921499cb403e288c83074df37cf704c0/src/backend/storage/smgr/md.c#L1947
in the pendingOpsTable which requests an fsync for a relfilenode which has already been deleted ("faulty" relfilenode). There is also no relation in pg_class with the "faulty" relfilenode.
To resolve the problem, a direct manipulation of memory of the primaries' checkpointer processes by means of debugger (gdb) was applied: entry->canceled set to true. After this, CHECKPOINT was run cluster-wide, finished successfully, and then gprecoverseg succeeded.
The success of this solution implies that:
- There was no "faulty" entry in shared buffers
- The problem could be resolved by a cluster restart (it was not applied because the cause of the problem was not fully understood)
Cluster load
The logs of the cluster show no ERRORs happened in the 5 minutes (checkpoint timeout on cluster) preceding the mirror failure, except for one:
...,cmd44,seg249,slice11,dx15653942,x19290166,sx1,"ERROR","XX000","Canceling query because of high VMEM usage. current group id is 16776, group memory usage 9472 MB, group shared memory quota is 3476 MB, slot memory quota is 336 MB, global freechunks memory is 480 MB, global safe memory threshold is 484 MB (runaway_cleaner.c:197)",,,,,,"...
set optimizer = on;
truncate table edw_stg_dm_efftp.tfct_oo_eff_tp_1_prt_p000153; -- Issue NOTE: This is a TRUNCATE of a single partition of a partitioned AO table; see its properties below
INSERT INTO edw_stg_dm_efftp.tfct_oo_eff_tp_1_prt_p000153 ...",0,,"runaway_cleaner.c",197,"Stack trace:
1 0x935f06 postgres errstart + 0x1f6
2 0x96b960 postgres RunawayCleaner_StartCleanup + 0x130
3 0x688f63 postgres ExecProcNode + 0x73
4 0x6a9c1a postgres ExecHashJoin + 0x1da
5 0x6892d1 postgres ExecProcNode + 0x3e1
6 0x6a9c1a postgres ExecHashJoin + 0x1da
7 0x6892d1 postgres ExecProcNode + 0x3e1
8 0x6a9c1a postgres ExecHashJoin + 0x1da
9 0x6892d1 postgres ExecProcNode + 0x3e1
10 0x6ba7b9 postgres ExecMotion + 0x39
11 0x6893d1 postgres ExecProcNode + 0x4e1
12 0x68643a postgres <symbol not found> + 0x68643a
13 0x686bfa postgres ExecutorRun + 0x2fa
14 0x82184a postgres <symbol not found> + 0x82184a
15 0x821c66 postgres <symbol not found> + 0x821c66
16 0x822cbe postgres PortalRun + 0x2ae
17 0x81b175 postgres <symbol not found> + 0x81b175
18 0x820706 postgres PostgresMain + 0x2246
19 0x7c2d16 postgres <symbol not found> + 0x7c2d16
20 0x7c4702 postgres PostmasterMain + 0xc62
21 0x4cc1fd postgres main + 0x39d
22 0x7fb3531fc3d5 libc.so.6 __libc_start_main + 0xf5
23 0x4cc7ef postgres <symbol not found> + 0x4cc7ef
"
\d+ edw_stg_dm_efftp.tfct_oo_eff_tp_1_prt_p000153:
Compression Type: zlib
Compression Level: 3
Block Size: 32768
Checksum: t
Check constraints:
"tfct_oo_eff_tp_1_prt_p000153_check" CHECK (placeholderA = 153::smallint)
Inherits: edw_stg_dm_efftp.tfct_oo_eff_tp
Has OIDs: no
Options: orientation=row, appendonly=true, compresstype=zlib, compresslevel=3
Distributed by: (placeholderB, placeholderC)
Tablespace: "default_hdd"
Unfortunately, the error does not reproduce consistently.
However, this is not the time it is encountered: a similar issue happened on the same cluster a few months ago. This is what was seen in the logs shortly before the ERROR first appeared:
11:18:32 MSK,0,,,seg-1,,,,,"LOG","00000","PrimaryMirrorTransitionRequest: to primary/mirror mode PrimarySegment, data state InResync, host ng-gp-prod-1-s-17.dh.rt.ru, port 11002, peer ng-gp-prod-1-s-1.dh.rt.ru, peerPort 11502, force full resync",,,,,,,0,,"primary_mirror_mode.c",1196,
...
11:18:33.641685,,,p47468,th1282709632,,,,0,,,seg-1,,,,,"LOG","00000","CHANGETRACKING: found an MD record. is full resync 0, last lsn (881/-377257680) is transition to resync completed 0, is transition to insync completed 0",,,,,,,0,,"cdbresynchronizechangetracking.c",2529,
11:18:33.641713,,,p47468,th1282709632,,,,0,,,seg-1,,,,,"LOG","00000","'full resync 'false' resync lsn '371/E9838130(881/3917709616)' ', mirroring role 'primary role' mirroring state 'resync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary resync manager process(47468)' 'cdbfilerepresyncmanager.c' 'L356' 'FileRepResyncManager_SetEndResyncLSN'",,,,,,,0,,"cdbfilerep.c",1824,
Then, the issue was resolved by a full cluster restart, but it did not receive proper attention.
The problem
The cause of this issue is unclear; namely, it is not clear how the "faulty" entry finds its way and stays in the pendingOpsTable.
Related information
I have done some research of GPDB 5.26 code related to the issue (primarily md.c); however, I only found that the fsync requests are submitted by backend and processed asynchronously, while TRUNCATE results in an immediate unlink().
This issue may be related to https://github.com/greenplum-db/gpdb/issues/9034.
https://github.com/greenplum-db/gpdb/pull/9425 seems to fix the issue, but the code there is for master and cannot be backported to 5X, as md.c has undergone huge changes in Postgres 9.0, as well as in recent GPDB versions. The test introduced in that PR does not work on 5X, too.
Hi @leskin-in ,
We will close this issue because GP5 is ending its life and it is time to migrate to 6X. Please feel free to reopen it if this issue still affect you. Thanks!