gpdb
gpdb copied to clipboard
ERROR: |58P01|could not open file "base/863243/t_40704489": No such file or directory
Bug Report
Greenplum version or build
- 6.19.1
OS version and uname -a
autoconf options used ( config.status --config )
Installation information ( pg_config )
Expected behavior
Actual behavior
$ gpssh -f /opt/greenplum/config/seg_hosts -e 'source /usr/local/greenplum-db/greenplum_path.sh ; gplogfilter -t /disk3/gpdata/gpsegment/primary/*/pg_log/gpdb-2022-07-16_090208.csv' >> seg_error.log
$ cat seg_error.log
...
[gp6sdw3] 2022-07-16 11:08:39.518195 CST|||p248|th1436305536||||0|||seg10|||||ERROR: |58P01|could not open file "base/863243/t_40704489": No such file or directory|||||writing block 1 of relation base/863243/t_40704489||0||md.c|665|Stack trace:
[gp6sdw3] 1 0xbf112c postgres errstart (elog.c:557)
[gp6sdw3] 2 0xa7ee33 postgres <symbol not found> (md.c:663)
[gp6sdw3] 3 0xa817f7 postgres mdwrite (md.c:1937)
[gp6sdw3] 4 0xa4f40a postgres <symbol not found> (bufmgr.c:2193)
[gp6sdw3] 5 0xa4fdb7 postgres BgBufferSync (bufmgr.c:2105)
[gp6sdw3] 6 0x9ff483 postgres BackgroundWriterMain (bgwriter.c:289)
[gp6sdw3] 7 0x785953 postgres AuxiliaryProcessMain (bootstrap.c:389)
[gp6sdw3] 8 0xa0d535 postgres <symbol not found> (postmaster.c:5854)
[gp6sdw3] 9 0xa0dba5 postgres <symbol not found> (postmaster.c:3262)
[gp6sdw3] 10 0x7dd152fbb630 libpthread.so.0 <symbol not found> + 0x52fbb630
[gp6sdw3] 11 0x7dd152434983 libc.so.6 __select + 0x13
[gp6sdw3] 12 0x6ad905 postgres <symbol not found> (postmaster.c:1894)
[gp6sdw3] 13 0xa0f982 postgres PostmasterMain (postmaster.c:1523)
[gp6sdw3] 14 0x6b24a1 postgres main (main.c:205)
[gp6sdw3] 15 0x7dd152361555 libc.so.6 __libc_start_main + 0xf5
[gp6sdw3] 16 0x6be1bc postgres <symbol not found> + 0x6be1bc
...
Step to reproduce the behavior
Hi, @cobolbaby
Thanks for reporting the issue.
Would you please paste the following SQL result (run them in the same session) here?
set gp_select_invisible = 1;
select * from gp_dist_random('pg_class') where relfilenode = 40704489;
In order to reduce the impact on online business, I took seg10 offline first. After switching to mirror, the error does not happen again.
pg_ctl -D /disk3/gpdata/gpsegment/primary/gpseg10 stop -m fast
In addition, I executed the following command before I offline the seg10, but no result was returned. Maybe gp_dist_random
works, but I didn't get the idea at that time. Next time I encounter this problem I will record it.
SELECT datname FROM pg_database WHERE oid = 863243;
\c xxx;
SELECT c.relname, c.relkind, n.nspname
FROM pg_class c
JOIN pg_namespace n on n.oid = c.relnamespace
WHERE c.relfilenode = 40704489;
@cobolbaby do you set gp_select_invisible?
do you set gp_select_invisible?
No. I have not set this parameter before.
do you set gp_select_invisible?
No. I have not set this parameter before.
So would you please collect the result as I mentioned in https://github.com/greenplum-db/gpdb/issues/13809#issuecomment-1186130642
set gp_select_invisible = 1;
select * from gp_dist_random('pg_class') where relfilenode = 40704489;
FROM pg_class c
JOIN pg_namespace n on n.oid = c.relnamespace
WHERE c.relfilenode = 40704489;
It should be an intermediate temporary table, which will be destroyed when it is used up.
F6_BDC=# set gp_select_invisible = 1;
SET
F6_BDC=# select * from gp_dist_random('pg_class') where relfilenode = 40704489;
F6_BDC=#
F6_BDC=#
F6_BDC=# select count(1) from gp_dist_random('pg_class') where relfilenode = 40704489;
count
-------
0
(1 行记录)
Same Question:
2022-07-17 19:01:03.039771 CST,,,p2287108,th448833664,,,,0,,,seg2,,,,,"ERROR","58P01","could not open file ""base/863243/t_40811755"": No such file or directory",,,,,"writing block 63 of relation base/863243/t_40811755",,0,,"md.c",665,"Stack trace:
1 0xbf112c postgres errstart (elog.c:557)
2 0xa7ee33 postgres <symbol not found> (md.c:663)
3 0xa817f7 postgres mdwrite (md.c:1937)
4 0xa4f40a postgres <symbol not found> (bufmgr.c:2193)
5 0xa4fdb7 postgres BgBufferSync (bufmgr.c:2105)
6 0x9ff483 postgres BackgroundWriterMain (bgwriter.c:289)
7 0x785953 postgres AuxiliaryProcessMain (bootstrap.c:389)
8 0xa0d535 postgres <symbol not found> (postmaster.c:5854)
9 0xa0e3c7 postgres <symbol not found> (postmaster.c:5523)
10 0x78f918201630 libpthread.so.0 <symbol not found> + 0x18201630
11 0x78f91767a983 libc.so.6 __select + 0x13
12 0x6ad905 postgres <symbol not found> (postmaster.c:1894)
13 0xa0f982 postgres PostmasterMain (postmaster.c:1523)
14 0x6b24a1 postgres main (main.c:205)
15 0x78f9175a7555 libc.so.6 __libc_start_main + 0xf5
16 0x6be1bc postgres <symbol not found> + 0x6be1bc
"
F6_BDC=# set gp_select_invisible = 1;
SET
F6_BDC=# select * from gp_dist_random('pg_class') where relfilenode = 40811755;
WARNING: could not write block 33 of base/863243/t_40811755 (seg2 slice1 10.13.0.31:40002 pid=245910)
描述: Multiple failures --- write error might be permanent.
ERROR: could not open file "base/863243/t_40811755": No such file or directory (seg2 slice1 10.13.0.31:40002 pid=245910)
背景: writing block 33 of relation base/863243/t_40811755
F6_BDC=# -- retry...
F6_BDC=# select * from gp_dist_random('pg_class') where relfilenode = 40811755;
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | reltoastrelid | relhasindex | relisshared | relpersistence | relkind | relstorage | relnatts | relchecks | relhasoids | relhaspkey | relhasrules | relhastriggers | relhassubclass | relispopulated | relreplident | relfrozenxid | relminmxid | relacl | reloptions
-------------------------------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+------------+----------+-----------+------------+------------+-------------+----------------+----------------+----------------+--------------+--------------+------------+--------+------------
idx_temp_mes_detail_group_key | 85946806 | 0 | 0 | 17133 | 403 | 40811755 | 0 | 135 | 65369 | 0 | 0 | f | f | t | i | h | 1 | 0 | f | f | f | f | f | t | n | 0 | 0 | |
(1 行记录)
This problem seems to be caused by gp_toolkit.gp_size_of_schema_disk?
2022-07-17 15:54:40.133290 CST,"bdctester","F6_BDC",p541320,th-1260595072,"192.168.2.12","57288",2022-07-16 09:33:54 CST,0,con18637748,cmd5528,seg-1,,,,sx1,"ERROR","58P01","could not open file ""base/863243/t_40786969"": No such file or directory (seg10 10.13.0.31:50002 pid=4191267)",,,,,"writing block 5 of relation base/863243/t_40786969","SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk
",0,,"md.c",665,
2022-07-17 15:54:40.133349 CST,"bdctester","F6_BDC",p541320,th-1260595072,"192.168.2.12","57288",2022-07-16 09:33:54 CST,0,con18637748,cmd5528,seg-1,,,,sx1,"LOG","00000","An exception was encountered during the execution of statement: SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk
",,,,,,"SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk
",0,,"elog.c",2116,
2022-07-17 16:10:42.477387 CST,,,p568,th-1260595072,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
WARNING: could not write block 4 of base/863243/t_40786969 (seg10 slice1 10.13.0.31:50002 pid=24170)
DETAIL: Multiple failures --- write error might be permanent.
WARNING: could not write block 5 of base/863243/t_40786969 (seg10 slice1 10.13.0.31:50002 pid=24205)
DETAIL: Multiple failures --- write error might be permanent.
",,,,,,,,"SysLoggerMain","syslogger.c",742,
F6_BDC=# SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk;
WARNING: could not write block 4 of base/863243/t_40786969 (seg10 10.13.0.31:50002 pid=139685)
描述: Multiple failures --- write error might be permanent.
ERROR: could not open file "base/863243/t_40786969": No such file or directory (seg10 10.13.0.31:50002 pid=139685)
背景: writing block 4 of relation base/863243/t_40786969
F6_BDC=# select count(1) from gp_dist_random('pg_class') where relfilenode = 40786969;
count
-------
0
(1 行记录)
F6_BDC=# SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk;
ERROR: non-MVCC snapshots are not supported in index-only scans (nodeIndexonlyscan.c:136)
The direct reason seems that there is a bad buffer in shared buffer, a buffer is dirty and point to the not-exists path. Then any SQL in theory might lead to the error shown in the logs: if the SQL decide to switch out just this dirty page.
But what is the root cause leading to such case is still unknown.
Do you still have the XLOGs in the errored segments?
But what is the root cause leading to such case is still unknown.
@kainwen The same problem happened to another index. I suspect that gp_toolkit.gp_size_of_schema_disk is causing these problems when calculating the index, because I have never met such a problem before executing SELECT current_database() datname, sosdnsp AS nspname, sosdschematablesize AS table_size_bytes, sosdschemaidxsize AS idx_size_bytes FROM gp_toolkit.gp_size_of_schema_disk;
regularly.
[gp6sdw1] 2022-07-18 09:46:56.764211 CST|||p262646|th-1860507520||||0|||seg2|||||ERROR: |58P01|could not open file "base/863243/t_40830378": No such file or directory|||||writing block 6 of relation base/863243/t_40830378||0||md.c|665|Stack trace:
[gp6sdw1] 1 0xbf112c postgres errstart (elog.c:557)
[gp6sdw1] 2 0xa7ee33 postgres <symbol not found> (md.c:663)
[gp6sdw1] 3 0xa817f7 postgres mdwrite (md.c:1937)
[gp6sdw1] 4 0xa4f40a postgres <symbol not found> (bufmgr.c:2193)
[gp6sdw1] 5 0xa4fdb7 postgres BgBufferSync (bufmgr.c:2105)
[gp6sdw1] 6 0x9ff483 postgres BackgroundWriterMain (bgwriter.c:289)
[gp6sdw1] 7 0x785953 postgres AuxiliaryProcessMain (bootstrap.c:389)
[gp6sdw1] 8 0xa0d535 postgres <symbol not found> (postmaster.c:5854)
[gp6sdw1] 9 0xa0dba5 postgres <symbol not found> (postmaster.c:3262)
[gp6sdw1] 10 0x73088e7a5630 libpthread.so.0 <symbol not found> + 0x8e7a5630
[gp6sdw1] 11 0x73088dc1e983 libc.so.6 __select + 0x13
[gp6sdw1] 12 0x6ad905 postgres <symbol not found> (postmaster.c:1894)
[gp6sdw1] 13 0xa0f982 postgres PostmasterMain (postmaster.c:1523)
[gp6sdw1] 14 0x6b24a1 postgres main (main.c:205)
[gp6sdw1] 15 0x73088db4b555 libc.so.6 __libc_start_main + 0xf5
[gp6sdw1] 16 0x6be1bc postgres <symbol not found> + 0x6be1bc
[gp6sdw1]
F6_BDC=# set gp_select_invisible = 1;
SET
F6_BDC=# select gp_segment_id,* from gp_dist_random('pg_class') where relfilenode = 40830378;
F6_BDC=# select gp_segment_id,relname from gp_dist_random('pg_class') where relfilenode = 40830378;
gp_segment_id | relname
---------------+-------------------------------------------
12 | temp_pca_model
12 | temp_pca_model
10 | pca_pca_stage_type
9 | temp_sn
2 | idx_temp_sn_location_orig_mcbsno_location
15 | pg_toast_85999883
(6 行记录)
Do you still have the XLOGs in the errored segments?
How to analyze XLOGs?
@cobolbaby You can grep xlog by
find . -name '000000030*' | xargs -i pg_xlogdump {} | grep 40704489
Please provide the outputs, thanks.
When will the temporary table relfile not exist, but shared_buffer has dirty buffer, Exclude manual deletion of relfile
This https://github.com/greenplum-db/gpdb/pull/14949 explains the scenario and also has test to demonstrate the same
Close the issue as a quick fix https://github.com/greenplum-db/gpdb/pull/15035 was merged.