gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

5X: could not fsync segment N of relation N/N/N: No such file or directory

Open leskin-in opened this issue 5 years ago • 1 comments

Greenplum version or build

5.26

OS version and uname -a

Red Hat Enterprise Linux Server release 7.6 (Maipo)

Linux ... 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Installation information

pg_config

CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/lib/gpdb/include
CFLAGS = -O3 -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -fno-aggressive-loop-optimizations -Wno-unused-but-set-variable -Wno-address -I/usr/lib/gpdb/include
CFLAGS_SL = -fpic
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/usr/lib/gpdb/lib',--enable-new-dtags -L/usr/lib/gpdb/lib
LDFLAGS_SL = 
LIBS = -lpgport -lgpopt -lnaucrates -lgpdbcost -lgpos -lxerces-c -lxslt -lxml2 -lrt -lkrb5 -lcom_err -lgssapi_krb5 -lzstd -lcrypt -ldl -lm -L/usr/lib/gpdb/lib
VERSION = PostgreSQL 8.3.23

The build is without asserts and debug information.

Actual behaviour

Two mirrors in a ~500 (including mirrors) segments cluster fail.

gprecoverseg is run then. The utility itself runs successfully; however, the resync process fails with the following errors:

  • Primary
2020-09-20 00:00:00.278247 MSK,,,p705,th657053824,,,,0,,,seg-1,,,,,"ERROR","58P01","could not fsync segment 0 of relation 16734/16384/15730654: No such file or directory",,,,,,,0,,"md.c",2058,"Stack trace:
1    0x935f06 postgres errstart + 0x1f6
2    0x812f99 postgres mdsync + 0x159
3    0x7f5c1b postgres CheckPointBuffers + 0xbb
4    0x527afd postgres CreateCheckPoint + 0x47d
5    0x7b1d21 postgres CheckpointerMain + 0x2c1
6    0x580125 postgres AuxiliaryProcessMain + 0x475
7    0x7ba610 postgres <symbol not found> + 0x7ba610
8    0x7ba8b2 postgres <symbol not found> + 0x7ba8b2
9    0x7bca8d postgres <symbol not found> + 0x7bca8d
10   0x7c3312 postgres <symbol not found> + 0x7c3312
11   0x7c4702 postgres PostmasterMain + 0xc62
12   0x4cc1fd postgres main + 0x39d
13   0x7f53229533d5 libc.so.6 __libc_start_main + 0xf5
14   0x4cc7ef postgres <symbol not found> + 0x4cc7ef
"
  • Mirror
2020-09-19 01:01:16.006443 MSK,,,p56738,th1478842496,,,,0,,,seg-1,,,,,"LOG","00000","'unlink failed identifier '/data2/mirror/gpseg0/pg_xlog/' errno '21' ', mirroring role 'mirror role' mirroring state 'resync' segment state 'up and running' filerep state 'up and running' process name(pid) 'mirror consumer process(56738)' 'cdbfilerepmirror.c' 'L2490' 'FileRepMirror_DropFilesFromDir'",,,,,,,0,,"cdbfilerep.c",1824,
2020-09-19 01:01:17.884089 MSK,,,p56737,th1478842496,,,2000-01-01 03:00:00 MSK,0,,,seg-1,,,,,"WARNING","XX000","receive close on connection: Success (cdbfilerepconnserver.c:343)",,,,,,,0,,"cdbfilerepconnserver.c",343,

Solution

The error originates from https://github.com/greenplum-db/gpdb/blob/eb3677e9921499cb403e288c83074df37cf704c0/src/backend/storage/smgr/md.c#L2052

Its direct cause is the presence of an entry https://github.com/greenplum-db/gpdb/blob/eb3677e9921499cb403e288c83074df37cf704c0/src/backend/storage/smgr/md.c#L1947 in the pendingOpsTable which requests an fsync for a relfilenode which has already been deleted ("faulty" relfilenode). There is also no relation in pg_class with the "faulty" relfilenode.

To resolve the problem, a direct manipulation of memory of the primaries' checkpointer processes by means of debugger (gdb) was applied: entry->canceled set to true. After this, CHECKPOINT was run cluster-wide, finished successfully, and then gprecoverseg succeeded.

The success of this solution implies that:

  • There was no "faulty" entry in shared buffers
  • The problem could be resolved by a cluster restart (it was not applied because the cause of the problem was not fully understood)

Cluster load

The logs of the cluster show no ERRORs happened in the 5 minutes (checkpoint timeout on cluster) preceding the mirror failure, except for one:

...,cmd44,seg249,slice11,dx15653942,x19290166,sx1,"ERROR","XX000","Canceling query because of high VMEM usage. current group id is 16776, group memory usage 9472 MB, group shared memory quota is 3476 MB, slot memory quota is 336 MB, global freechunks memory is 480 MB, global safe memory threshold is 484 MB (runaway_cleaner.c:197)",,,,,,"...
set optimizer = on;

truncate table edw_stg_dm_efftp.tfct_oo_eff_tp_1_prt_p000153;  -- Issue NOTE: This is a TRUNCATE of a single partition of a partitioned AO table; see its properties below

INSERT INTO edw_stg_dm_efftp.tfct_oo_eff_tp_1_prt_p000153 ...",0,,"runaway_cleaner.c",197,"Stack trace:
1    0x935f06 postgres errstart + 0x1f6
2    0x96b960 postgres RunawayCleaner_StartCleanup + 0x130
3    0x688f63 postgres ExecProcNode + 0x73
4    0x6a9c1a postgres ExecHashJoin + 0x1da
5    0x6892d1 postgres ExecProcNode + 0x3e1
6    0x6a9c1a postgres ExecHashJoin + 0x1da
7    0x6892d1 postgres ExecProcNode + 0x3e1
8    0x6a9c1a postgres ExecHashJoin + 0x1da
9    0x6892d1 postgres ExecProcNode + 0x3e1
10   0x6ba7b9 postgres ExecMotion + 0x39
11   0x6893d1 postgres ExecProcNode + 0x4e1
12   0x68643a postgres <symbol not found> + 0x68643a
13   0x686bfa postgres ExecutorRun + 0x2fa
14   0x82184a postgres <symbol not found> + 0x82184a
15   0x821c66 postgres <symbol not found> + 0x821c66
16   0x822cbe postgres PortalRun + 0x2ae
17   0x81b175 postgres <symbol not found> + 0x81b175
18   0x820706 postgres PostgresMain + 0x2246
19   0x7c2d16 postgres <symbol not found> + 0x7c2d16
20   0x7c4702 postgres PostmasterMain + 0xc62
21   0x4cc1fd postgres main + 0x39d
22   0x7fb3531fc3d5 libc.so.6 __libc_start_main + 0xf5
23   0x4cc7ef postgres <symbol not found> + 0x4cc7ef
"

\d+ edw_stg_dm_efftp.tfct_oo_eff_tp_1_prt_p000153:

Compression Type: zlib
Compression Level: 3
Block Size: 32768
Checksum: t
Check constraints:
    "tfct_oo_eff_tp_1_prt_p000153_check" CHECK (placeholderA = 153::smallint)
Inherits: edw_stg_dm_efftp.tfct_oo_eff_tp
Has OIDs: no
Options: orientation=row, appendonly=true, compresstype=zlib, compresslevel=3
Distributed by: (placeholderB, placeholderC)
Tablespace: "default_hdd"

Unfortunately, the error does not reproduce consistently.

However, this is not the time it is encountered: a similar issue happened on the same cluster a few months ago. This is what was seen in the logs shortly before the ERROR first appeared:

11:18:32 MSK,0,,,seg-1,,,,,"LOG","00000","PrimaryMirrorTransitionRequest: to primary/mirror mode PrimarySegment, data state InResync, host ng-gp-prod-1-s-17.dh.rt.ru, port 11002, peer ng-gp-prod-1-s-1.dh.rt.ru, peerPort 11502, force full resync",,,,,,,0,,"primary_mirror_mode.c",1196,
...
11:18:33.641685,,,p47468,th1282709632,,,,0,,,seg-1,,,,,"LOG","00000","CHANGETRACKING: found an MD record. is full resync 0, last lsn (881/-377257680) is transition to resync completed 0, is transition to insync completed 0",,,,,,,0,,"cdbresynchronizechangetracking.c",2529,
11:18:33.641713,,,p47468,th1282709632,,,,0,,,seg-1,,,,,"LOG","00000","'full resync 'false' resync lsn '371/E9838130(881/3917709616)' ', mirroring role 'primary role' mirroring state 'resync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary resync manager process(47468)' 'cdbfilerepresyncmanager.c' 'L356' 'FileRepResyncManager_SetEndResyncLSN'",,,,,,,0,,"cdbfilerep.c",1824,

Then, the issue was resolved by a full cluster restart, but it did not receive proper attention.

The problem

The cause of this issue is unclear; namely, it is not clear how the "faulty" entry finds its way and stays in the pendingOpsTable.

Related information

I have done some research of GPDB 5.26 code related to the issue (primarily md.c); however, I only found that the fsync requests are submitted by backend and processed asynchronously, while TRUNCATE results in an immediate unlink().

This issue may be related to https://github.com/greenplum-db/gpdb/issues/9034.

https://github.com/greenplum-db/gpdb/pull/9425 seems to fix the issue, but the code there is for master and cannot be backported to 5X, as md.c has undergone huge changes in Postgres 9.0, as well as in recent GPDB versions. The test introduced in that PR does not work on 5X, too.

leskin-in avatar Oct 06 '20 12:10 leskin-in

Hi @leskin-in ,

We will close this issue because GP5 is ending its life and it is time to migrate to 6X. Please feel free to reopen it if this issue still affect you. Thanks!

lij55 avatar Feb 15 '22 09:02 lij55