gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

ERROR: |58P01|could not open file "base/863243/t_40704489": No such file or directory

Open cobolbaby opened this issue 2 years ago • 9 comments

Bug Report

Greenplum version or build

  • 6.19.1

OS version and uname -a

autoconf options used ( config.status --config )

Installation information ( pg_config )

Expected behavior

Actual behavior

$ gpssh -f /opt/greenplum/config/seg_hosts -e 'source /usr/local/greenplum-db/greenplum_path.sh ; gplogfilter -t /disk3/gpdata/gpsegment/primary/*/pg_log/gpdb-2022-07-16_090208.csv' >> seg_error.log
$ cat seg_error.log
...
[gp6sdw3] 2022-07-16 11:08:39.518195 CST|||p248|th1436305536||||0|||seg10|||||ERROR: |58P01|could not open file "base/863243/t_40704489": No such file or directory|||||writing block 1 of relation base/863243/t_40704489||0||md.c|665|Stack trace:
[gp6sdw3] 1    0xbf112c postgres errstart (elog.c:557)
[gp6sdw3] 2    0xa7ee33 postgres <symbol not found> (md.c:663)
[gp6sdw3] 3    0xa817f7 postgres mdwrite (md.c:1937)
[gp6sdw3] 4    0xa4f40a postgres <symbol not found> (bufmgr.c:2193)
[gp6sdw3] 5    0xa4fdb7 postgres BgBufferSync (bufmgr.c:2105)
[gp6sdw3] 6    0x9ff483 postgres BackgroundWriterMain (bgwriter.c:289)
[gp6sdw3] 7    0x785953 postgres AuxiliaryProcessMain (bootstrap.c:389)
[gp6sdw3] 8    0xa0d535 postgres <symbol not found> (postmaster.c:5854)
[gp6sdw3] 9    0xa0dba5 postgres <symbol not found> (postmaster.c:3262)
[gp6sdw3] 10   0x7dd152fbb630 libpthread.so.0 <symbol not found> + 0x52fbb630
[gp6sdw3] 11   0x7dd152434983 libc.so.6 __select + 0x13
[gp6sdw3] 12   0x6ad905 postgres <symbol not found> (postmaster.c:1894)
[gp6sdw3] 13   0xa0f982 postgres PostmasterMain (postmaster.c:1523)
[gp6sdw3] 14   0x6b24a1 postgres main (main.c:205)
[gp6sdw3] 15   0x7dd152361555 libc.so.6 __libc_start_main + 0xf5
[gp6sdw3] 16   0x6be1bc postgres <symbol not found> + 0x6be1bc
...

seg_error.log

Step to reproduce the behavior

cobolbaby avatar Jul 16 '22 03:07 cobolbaby

Hi, @cobolbaby

Thanks for reporting the issue.

Would you please paste the following SQL result (run them in the same session) here?

set gp_select_invisible = 1;
select * from gp_dist_random('pg_class') where relfilenode = 40704489;

kainwen avatar Jul 16 '22 09:07 kainwen

In order to reduce the impact on online business, I took seg10 offline first. After switching to mirror, the error does not happen again.

pg_ctl -D /disk3/gpdata/gpsegment/primary/gpseg10 stop -m fast

In addition, I executed the following command before I offline the seg10, but no result was returned. Maybe gp_dist_random works, but I didn't get the idea at that time. Next time I encounter this problem I will record it.

SELECT datname FROM pg_database WHERE oid = 863243;

\c xxx;

SELECT c.relname, c.relkind, n.nspname 
FROM pg_class c
	JOIN pg_namespace n on n.oid = c.relnamespace
WHERE c.relfilenode = 40704489;

cobolbaby avatar Jul 17 '22 02:07 cobolbaby

@cobolbaby do you set gp_select_invisible?

kainwen avatar Jul 17 '22 06:07 kainwen

do you set gp_select_invisible?

No. I have not set this parameter before.

cobolbaby avatar Jul 17 '22 07:07 cobolbaby

do you set gp_select_invisible?

No. I have not set this parameter before.

So would you please collect the result as I mentioned in https://github.com/greenplum-db/gpdb/issues/13809#issuecomment-1186130642

set gp_select_invisible = 1;
select * from gp_dist_random('pg_class') where relfilenode = 40704489;
FROM pg_class c
	JOIN pg_namespace n on n.oid = c.relnamespace
WHERE c.relfilenode = 40704489;

kainwen avatar Jul 17 '22 07:07 kainwen

It should be an intermediate temporary table, which will be destroyed when it is used up.

F6_BDC=# set gp_select_invisible = 1;
SET
F6_BDC=# select * from gp_dist_random('pg_class') where relfilenode = 40704489;
F6_BDC=# 
F6_BDC=# 
F6_BDC=# select count(1) from gp_dist_random('pg_class') where relfilenode = 40704489;
 count 
-------
     0
(1 行记录)

Same Question:

2022-07-17 19:01:03.039771 CST,,,p2287108,th448833664,,,,0,,,seg2,,,,,"ERROR","58P01","could not open file ""base/863243/t_40811755"": No such file or directory",,,,,"writing block 63 of relation base/863243/t_40811755",,0,,"md.c",665,"Stack trace:
1    0xbf112c postgres errstart (elog.c:557)
2    0xa7ee33 postgres <symbol not found> (md.c:663)
3    0xa817f7 postgres mdwrite (md.c:1937)
4    0xa4f40a postgres <symbol not found> (bufmgr.c:2193)
5    0xa4fdb7 postgres BgBufferSync (bufmgr.c:2105)
6    0x9ff483 postgres BackgroundWriterMain (bgwriter.c:289)
7    0x785953 postgres AuxiliaryProcessMain (bootstrap.c:389)
8    0xa0d535 postgres <symbol not found> (postmaster.c:5854)
9    0xa0e3c7 postgres <symbol not found> (postmaster.c:5523)
10   0x78f918201630 libpthread.so.0 <symbol not found> + 0x18201630
11   0x78f91767a983 libc.so.6 __select + 0x13
12   0x6ad905 postgres <symbol not found> (postmaster.c:1894)
13   0xa0f982 postgres PostmasterMain (postmaster.c:1523)
14   0x6b24a1 postgres main (main.c:205)
15   0x78f9175a7555 libc.so.6 __libc_start_main + 0xf5
16   0x6be1bc postgres <symbol not found> + 0x6be1bc
"
F6_BDC=# set gp_select_invisible = 1;
SET
F6_BDC=# select * from gp_dist_random('pg_class') where relfilenode = 40811755;
WARNING:  could not write block 33 of base/863243/t_40811755  (seg2 slice1 10.13.0.31:40002 pid=245910)
描述:  Multiple failures --- write error might be permanent.
ERROR:  could not open file "base/863243/t_40811755": No such file or directory  (seg2 slice1 10.13.0.31:40002 pid=245910)
背景:  writing block 33 of relation base/863243/t_40811755
F6_BDC=# -- retry...
F6_BDC=# select * from gp_dist_random('pg_class') where relfilenode = 40811755;
            relname            | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | reltoastrelid | relhasindex | relisshared | relpersistence | relkind | relstorage | relnatts | relchecks | relhasoids | relhaspkey | relhasrules | relhastriggers | relhassubclass | relispopulated | relreplident | relfrozenxid | relminmxid | relacl | reloptions 
-------------------------------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+------------+----------+-----------+------------+------------+-------------+----------------+----------------+----------------+--------------+--------------+------------+--------+------------
 idx_temp_mes_detail_group_key |     85946806 |       0 |         0 |    17133 |   403 |    40811755 |             0 |      135 |     65369 |             0 |             0 | f           | f           | t              | i       | h          |        1 |         0 | f          | f          | f           | f              | f              | t              | n            |            0 |          0 |        | 
(1 行记录)

This problem seems to be caused by gp_toolkit.gp_size_of_schema_disk?

2022-07-17 15:54:40.133290 CST,"bdctester","F6_BDC",p541320,th-1260595072,"192.168.2.12","57288",2022-07-16 09:33:54 CST,0,con18637748,cmd5528,seg-1,,,,sx1,"ERROR","58P01","could not open file ""base/863243/t_40786969"": No such file or directory  (seg10 10.13.0.31:50002 pid=4191267)",,,,,"writing block 5 of relation base/863243/t_40786969","SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk
",0,,"md.c",665,
2022-07-17 15:54:40.133349 CST,"bdctester","F6_BDC",p541320,th-1260595072,"192.168.2.12","57288",2022-07-16 09:33:54 CST,0,con18637748,cmd5528,seg-1,,,,sx1,"LOG","00000","An exception was encountered during the execution of statement: SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk
",,,,,,"SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk
",0,,"elog.c",2116,
2022-07-17 16:10:42.477387 CST,,,p568,th-1260595072,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
WARNING:  could not write block 4 of base/863243/t_40786969  (seg10 slice1 10.13.0.31:50002 pid=24170)
DETAIL:  Multiple failures --- write error might be permanent.
WARNING:  could not write block 5 of base/863243/t_40786969  (seg10 slice1 10.13.0.31:50002 pid=24205)
DETAIL:  Multiple failures --- write error might be permanent.
",,,,,,,,"SysLoggerMain","syslogger.c",742,

F6_BDC=# SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk;
WARNING:  could not write block 4 of base/863243/t_40786969  (seg10 10.13.0.31:50002 pid=139685)
描述:  Multiple failures --- write error might be permanent.
ERROR:  could not open file "base/863243/t_40786969": No such file or directory  (seg10 10.13.0.31:50002 pid=139685)
背景:  writing block 4 of relation base/863243/t_40786969
F6_BDC=# select count(1) from gp_dist_random('pg_class') where relfilenode = 40786969;
 count 
-------
     0
(1 行记录)

F6_BDC=# SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk;
ERROR:  non-MVCC snapshots are not supported in index-only scans (nodeIndexonlyscan.c:136)

cobolbaby avatar Jul 17 '22 09:07 cobolbaby

The direct reason seems that there is a bad buffer in shared buffer, a buffer is dirty and point to the not-exists path. Then any SQL in theory might lead to the error shown in the logs: if the SQL decide to switch out just this dirty page.

But what is the root cause leading to such case is still unknown.

Do you still have the XLOGs in the errored segments?

kainwen avatar Jul 17 '22 12:07 kainwen

But what is the root cause leading to such case is still unknown.

@kainwen The same problem happened to another index. I suspect that gp_toolkit.gp_size_of_schema_disk is causing these problems when calculating the index, because I have never met such a problem before executing SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk; regularly.

[gp6sdw1] 2022-07-18 09:46:56.764211 CST|||p262646|th-1860507520||||0|||seg2|||||ERROR: |58P01|could not open file "base/863243/t_40830378": No such file or directory|||||writing block 6 of relation base/863243/t_40830378||0||md.c|665|Stack trace:
[gp6sdw1] 1    0xbf112c postgres errstart (elog.c:557)
[gp6sdw1] 2    0xa7ee33 postgres <symbol not found> (md.c:663)
[gp6sdw1] 3    0xa817f7 postgres mdwrite (md.c:1937)
[gp6sdw1] 4    0xa4f40a postgres <symbol not found> (bufmgr.c:2193)
[gp6sdw1] 5    0xa4fdb7 postgres BgBufferSync (bufmgr.c:2105)
[gp6sdw1] 6    0x9ff483 postgres BackgroundWriterMain (bgwriter.c:289)
[gp6sdw1] 7    0x785953 postgres AuxiliaryProcessMain (bootstrap.c:389)
[gp6sdw1] 8    0xa0d535 postgres <symbol not found> (postmaster.c:5854)
[gp6sdw1] 9    0xa0dba5 postgres <symbol not found> (postmaster.c:3262)
[gp6sdw1] 10   0x73088e7a5630 libpthread.so.0 <symbol not found> + 0x8e7a5630
[gp6sdw1] 11   0x73088dc1e983 libc.so.6 __select + 0x13
[gp6sdw1] 12   0x6ad905 postgres <symbol not found> (postmaster.c:1894)
[gp6sdw1] 13   0xa0f982 postgres PostmasterMain (postmaster.c:1523)
[gp6sdw1] 14   0x6b24a1 postgres main (main.c:205)
[gp6sdw1] 15   0x73088db4b555 libc.so.6 __libc_start_main + 0xf5
[gp6sdw1] 16   0x6be1bc postgres <symbol not found> + 0x6be1bc
[gp6sdw1]
F6_BDC=# set gp_select_invisible = 1;
SET
F6_BDC=# select gp_segment_id,* from gp_dist_random('pg_class') where relfilenode = 40830378;
F6_BDC=# select gp_segment_id,relname from gp_dist_random('pg_class') where relfilenode = 40830378;
 gp_segment_id |                  relname                  
---------------+-------------------------------------------
            12 | temp_pca_model
            12 | temp_pca_model
            10 | pca_pca_stage_type
             9 | temp_sn
             2 | idx_temp_sn_location_orig_mcbsno_location
            15 | pg_toast_85999883
(6 行记录)

Do you still have the XLOGs in the errored segments?

How to analyze XLOGs?

2022-07-17 21-04-53屏幕截图

cobolbaby avatar Jul 17 '22 12:07 cobolbaby

@cobolbaby You can grep xlog by

find . -name '000000030*' | xargs -i pg_xlogdump {} | grep 40704489

Please provide the outputs, thanks.

interma avatar Jul 23 '22 11:07 interma

When will the temporary table relfile not exist, but shared_buffer has dirty buffer, Exclude manual deletion of relfile

chufei333 avatar Feb 18 '23 14:02 chufei333

This https://github.com/greenplum-db/gpdb/pull/14949 explains the scenario and also has test to demonstrate the same

ashwinstar avatar Feb 18 '23 15:02 ashwinstar

Close the issue as a quick fix https://github.com/greenplum-db/gpdb/pull/15035 was merged.

haolinw avatar Mar 01 '23 00:03 haolinw