gpdb
gpdb copied to clipboard
[6X] Add GUC ignore_invalid_pages
Cherry-pick of https://github.com/postgres/postgres/commit/41c184bc642b25f67fb1d8ee290f28805fa5a0b4
Here are some reminders before you submit the pull request
- [ ] Add tests for the change
- [ ] Document changes
- [ ] Communicate in the mailing list if needed
- [ ] Pass
make installcheck - [ ] Review a PR in return to support the community
Sure helpful to backport. Just curious what promoted - did you encounter the invalid page case and any possibility of sharing the scenario and if underlying cause can be fixed as well. Misuse of this GUC is only concern where users not reporting issues and going forward with regular business after setting it. Seems it doesn't provide guidance on taking out of config file once set and such.
This setting allows to make a partial restore of a table or database (for example in https://github.com/wal-g/wal-g). We plan to turn it on for the recovery period and then turn it off.
Should open up a PR against GPDB main branch as well to make things consistent.
This setting allows to make a partial restore of a table or database (for example in https://github.com/wal-g/wal-g). We plan to turn it on for the recovery period and then turn it off.
Please can you elaborate with example. Thanks.
Also, given the GUC is generically added to code base should add text around its setting resetting aspects in GUC definition (understand upstream should have done better job).
Please can you elaborate with example. Thanks.
Problem:
We have backup of the whole greenplum cluster with wal files in s3. We want to restore only specific table (suppose table1). But we don't want to download the whole backup.
Current solution without ignore_invalid_pages:
- Run command
wal-g backup-fetch 'some_backup' --restore-only=tbl1
that download and copy to /base directory only files related to table1. Also this command setup recovery.conf files like this:
restore_command = '/usr/bin/wal-g seg wal-fetch "%f" "%p" --content-id=-1 --config /etc/wal-g/wal-g.yaml'
recovery_target_name = 'some_backup'
recovery_target_timeline = latest
- Run
gpstart -aand sometimes(1) get this error:
UTC,,,p10640,th-1769571392,,,,0,,,seg0,,,,,"PANIC","XX000","WAL contains references to invalid pages",,,,,,,0,,"xlogutils.c",247,"Stack trace:
1 0x555e35f56091 postgres errstart + 0x1f1
2 0x555e35f58e1e postgres elog_finish + 0xce
3 0x555e35a791e6 postgres <symbol not found> + 0x35a791e6
4 0x555e35a67b0d postgres <symbol not found> + 0x35a67b0d
5 0x555e35a6ff0d postgres StartupXLOG + 0x14cd
6 0x555e35d56969 postgres StartupProcessMain + 0x139
7 0x555e35ab08cc postgres AuxiliaryProcessMain + 0x55c
8 0x555e35d52149 postgres <symbol not found> + 0x35d52149
9 0x555e35d56106 postgres PostmasterMain + 0x1186
10 0x555e359d61aa postgres main + 0x4aa
11 0x7ff393e53c87 libc.so.6 __libc_start_main + 0xe7
12 0x555e359e229a postgres _start + 0x2a
startup failed!
Solution with ignore_invalid_pages:
- we run the same command
wal-g backup-fetch 'some_backup' --restore-only=tbl1
But now it insert ignore_invalid_pages = on into postgresql.conf's of each segment
- Start cluster: no PANICs!
- Remove
ignore_invalid_pages = on - Restart cluster
- Use:
psql -p 6000 -c 'select * from table1'
partial recovery was successful!
(1) when we create a backup under load and there are some changes between pg_start_backup() and pg_stop_backup(), to table other than table1
Also, given the GUC is generically added to code base should add text around its setting resetting aspects in GUC definition (understand upstream should have done better job).
Sorry I didn't see the "Set classification" section right away. I thinks to minimize difference with postgres, we should use POSTMASTER as GucContext. From this it follows that you need to choose restart instead of reload. From using restart it follows - system instead of session. Since this is a dangerous settings, we should be able to enable it only on the right segments, so I think we need to choose the local instead of the master.
I will update guc-list.html.md today
Thanks a lot @nikifkon for all those clarifications - better understand the purpose and usage for this GUC.
Please can you elaborate with example. Thanks.
Problem:
We have backup of the whole greenplum cluster with wal files in s3. We want to restore only specific table (suppose table1). But we don't want to download the whole backup.
Current solution without ignore_invalid_pages:
- Run command
wal-g backup-fetch 'some_backup' --restore-only=tbl1that download and copy to
/basedirectory only files related to table1. Also this command setuprecovery.conffiles like this:restore_command = '/usr/bin/wal-g seg wal-fetch "%f" "%p" --content-id=-1 --config /etc/wal-g/wal-g.yaml' recovery_target_name = 'some_backup' recovery_target_timeline = latest
- Run
gpstart -aand sometimes(1) get this error:UTC,,,p10640,th-1769571392,,,,0,,,seg0,,,,,"PANIC","XX000","WAL contains references to invalid pages",,,,,,,0,,"xlogutils.c",247,"Stack trace: 1 0x555e35f56091 postgres errstart + 0x1f1 2 0x555e35f58e1e postgres elog_finish + 0xce 3 0x555e35a791e6 postgres <symbol not found> + 0x35a791e6 4 0x555e35a67b0d postgres <symbol not found> + 0x35a67b0d 5 0x555e35a6ff0d postgres StartupXLOG + 0x14cd 6 0x555e35d56969 postgres StartupProcessMain + 0x139 7 0x555e35ab08cc postgres AuxiliaryProcessMain + 0x55c 8 0x555e35d52149 postgres <symbol not found> + 0x35d52149 9 0x555e35d56106 postgres PostmasterMain + 0x1186 10 0x555e359d61aa postgres main + 0x4aa 11 0x7ff393e53c87 libc.so.6 __libc_start_main + 0xe7 12 0x555e359e229a postgres _start + 0x2astartup failed!
Solution with ignore_invalid_pages:
- we run the same command
wal-g backup-fetch 'some_backup' --restore-only=tbl1But now it insert
ignore_invalid_pages = onintopostgresql.conf's of each segment
- Start cluster: no PANICs!
- Remove
ignore_invalid_pages = on- Restart cluster
- Use:
psql -p 6000 -c 'select * from table1'partial recovery was successful!
(1) when we create a backup under load and there are some changes between
pg_start_backup()andpg_stop_backup(), to table other thantable1
Interesting... reads fancy. How much of this is practically possible I have my doubts about it. I am guessing there is no skipping of catalog/system tables possible - so will have to identify all those relfiles and replay contents for them. Not sure how such identifications is done. Please resolving all other related tables dependencies and also replaying things corresponding to it.
I am guessing there is no skipping of catalog/system tables possible
In wal-g implementation we download all system (with oid <= 16384) relfiles/databases and tables from pg_aoseg. namespace. Wal-g also replay all wal, and as far as I know it is replayed correctly on existence file and some how on skipped.
Please resolving all other related tables dependencies and also replaying things corresponding to it.
Yeah. Nice idea todo.
I am guessing there is no skipping of catalog/system tables possible
In wal-g implementation we download all system (with oid <= 16384) relfiles/databases and tables from
pg_aoseg.namespace. Wal-g also replay all wal, and as far as I know it is replayed correctly on existence file and some how on skipped.
Curious to learn how wal-g traces out the relfilenodes corresponding to the system and pg_aoseg tables and passes such info to restore side.
Curious to learn how wal-g traces out the relfilenodes corresponding to the system and pg_aoseg tables and passes such info to restore side.
If by trace out you mean select what to download from s3
Logic for system tables: we consider relfilenode as a system if 1 or 2
- it is from system catalog (with dirname <= 16384)
- file has system prefix
To select tables from namespace pg_aoseg we have the following structure in s3 associated with backup:
"DatabasesByNames": {
"db": {
"oid": 24576,
"tables": {
"pg_aoseg.pg_aocsseg_24601": 24587,
"pg_aoseg.pg_aocsseg_24608": 24591,
"pg_aoseg.pg_aoseg_24587": 24579,
"pg_aoseg.pg_aoseg_24594": 24583,
"pg_aoseg.pg_aovisimap_24587": 24580,
"pg_aoseg.pg_aovisimap_24587_index": 24581,
"pg_aoseg.pg_aovisimap_24594": 24584,
"pg_aoseg.pg_aovisimap_24594_index": 24585,
"pg_aoseg.pg_aovisimap_24601": 24588,
"pg_aoseg.pg_aovisimap_24601_index": 24589,
"pg_aoseg.pg_aovisimap_24608": 24592,
"pg_aoseg.pg_aovisimap_24608_index": 24593,
"public.ao_to_restore": 24578,
"public.ao_to_skip": 24582,
"public.co_to_restore": 24586,
"public.co_to_skip": 24590,
"public.heap_to_restore": 24576,
"public.heap_to_skip": 24577
}
},
"gpadmin": {
"oid": 16384
},
"postgres": {
"oid": 12812,
"tables": {
"public.gp_stat_archiver": 16384
}
},
"template1": {
"oid": 1
}
}
Curious to learn how wal-g traces out the relfilenodes corresponding to the system and pg_aoseg tables and passes such info to restore side.
If by trace out you mean select what to download from s3
Logic for system tables: we consider relfilenode as a system if 1 or 2
- it is from system catalog (with dirname <= 16384)
- file has system prefix
To select tables from namespace
pg_aosegwe have the following structure in s3 associated with backup:"DatabasesByNames": { "db": { "oid": 24576, "tables": { "pg_aoseg.pg_aocsseg_24601": 24587, "pg_aoseg.pg_aocsseg_24608": 24591, "pg_aoseg.pg_aoseg_24587": 24579, "pg_aoseg.pg_aoseg_24594": 24583, "pg_aoseg.pg_aovisimap_24587": 24580, "pg_aoseg.pg_aovisimap_24587_index": 24581, "pg_aoseg.pg_aovisimap_24594": 24584, "pg_aoseg.pg_aovisimap_24594_index": 24585, "pg_aoseg.pg_aovisimap_24601": 24588, "pg_aoseg.pg_aovisimap_24601_index": 24589, "pg_aoseg.pg_aovisimap_24608": 24592, "pg_aoseg.pg_aovisimap_24608_index": 24593, "public.ao_to_restore": 24578, "public.ao_to_skip": 24582, "public.co_to_restore": 24586, "public.co_to_skip": 24590, "public.heap_to_restore": 24576, "public.heap_to_skip": 24577 } }, "gpadmin": { "oid": 16384 }, "postgres": { "oid": 12812, "tables": { "public.gp_stat_archiver": 16384 } }, "template1": { "oid": 1 } }
Interesting, Fancy, thanks. Seems, part of catalog information is exported in dump file to be used during such partial restores. (Isn't clear the number against the table names are OIDs or relfilenodes - guessing based on usage its relfilenodes).
Can someone tell me please which test has failed?
Folks, let's get this merged :)
+1
Real needed functionality, please merge