gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

[6X] Add GUC ignore_invalid_pages

Open nikifkon opened this issue 2 years ago • 16 comments
trafficstars

Cherry-pick of https://github.com/postgres/postgres/commit/41c184bc642b25f67fb1d8ee290f28805fa5a0b4

Here are some reminders before you submit the pull request

  • [ ] Add tests for the change
  • [ ] Document changes
  • [ ] Communicate in the mailing list if needed
  • [ ] Pass make installcheck
  • [ ] Review a PR in return to support the community

nikifkon avatar Jul 07 '23 13:07 nikifkon

Sure helpful to backport. Just curious what promoted - did you encounter the invalid page case and any possibility of sharing the scenario and if underlying cause can be fixed as well. Misuse of this GUC is only concern where users not reporting issues and going forward with regular business after setting it. Seems it doesn't provide guidance on taking out of config file once set and such.

ashwinstar avatar Jul 07 '23 13:07 ashwinstar

This setting allows to make a partial restore of a table or database (for example in https://github.com/wal-g/wal-g). We plan to turn it on for the recovery period and then turn it off.

nikifkon avatar Jul 07 '23 14:07 nikifkon

Should open up a PR against GPDB main branch as well to make things consistent.

jimmyyih avatar Jul 08 '23 00:07 jimmyyih

This setting allows to make a partial restore of a table or database (for example in https://github.com/wal-g/wal-g). We plan to turn it on for the recovery period and then turn it off.

Please can you elaborate with example. Thanks.

Also, given the GUC is generically added to code base should add text around its setting resetting aspects in GUC definition (understand upstream should have done better job).

ashwinstar avatar Jul 08 '23 08:07 ashwinstar

Please can you elaborate with example. Thanks.

Problem:

We have backup of the whole greenplum cluster with wal files in s3. We want to restore only specific table (suppose table1). But we don't want to download the whole backup.

Current solution without ignore_invalid_pages:

  1. Run command
wal-g backup-fetch 'some_backup' --restore-only=tbl1

that download and copy to /base directory only files related to table1. Also this command setup recovery.conf files like this:

restore_command = '/usr/bin/wal-g seg wal-fetch "%f" "%p" --content-id=-1 --config /etc/wal-g/wal-g.yaml'
recovery_target_name = 'some_backup'
recovery_target_timeline = latest
  1. Run gpstart -a and sometimes(1) get this error:
UTC,,,p10640,th-1769571392,,,,0,,,seg0,,,,,"PANIC","XX000","WAL contains references to invalid pages",,,,,,,0,,"xlogutils.c",247,"Stack trace:
1    0x555e35f56091 postgres errstart + 0x1f1
2    0x555e35f58e1e postgres elog_finish + 0xce
3    0x555e35a791e6 postgres <symbol not found> + 0x35a791e6
4    0x555e35a67b0d postgres <symbol not found> + 0x35a67b0d
5    0x555e35a6ff0d postgres StartupXLOG + 0x14cd
6    0x555e35d56969 postgres StartupProcessMain + 0x139
7    0x555e35ab08cc postgres AuxiliaryProcessMain + 0x55c
8    0x555e35d52149 postgres <symbol not found> + 0x35d52149
9    0x555e35d56106 postgres PostmasterMain + 0x1186
10   0x555e359d61aa postgres main + 0x4aa
11   0x7ff393e53c87 libc.so.6 __libc_start_main + 0xe7
12   0x555e359e229a postgres _start + 0x2a

startup failed!

Solution with ignore_invalid_pages:

  1. we run the same command
wal-g backup-fetch 'some_backup' --restore-only=tbl1

But now it insert ignore_invalid_pages = on into postgresql.conf's of each segment

  1. Start cluster: no PANICs!
  2. Remove ignore_invalid_pages = on
  3. Restart cluster
  4. Use: psql -p 6000 -c 'select * from table1'

partial recovery was successful!


(1) when we create a backup under load and there are some changes between pg_start_backup() and pg_stop_backup(), to table other than table1

nikifkon avatar Jul 08 '23 13:07 nikifkon

Also, given the GUC is generically added to code base should add text around its setting resetting aspects in GUC definition (understand upstream should have done better job).

Sorry I didn't see the "Set classification" section right away. I thinks to minimize difference with postgres, we should use POSTMASTER as GucContext. From this it follows that you need to choose restart instead of reload. From using restart it follows - system instead of session. Since this is a dangerous settings, we should be able to enable it only on the right segments, so I think we need to choose the local instead of the master.

I will update guc-list.html.md today

nikifkon avatar Jul 10 '23 07:07 nikifkon

Thanks a lot @nikifkon for all those clarifications - better understand the purpose and usage for this GUC.

ashwinstar avatar Jul 10 '23 08:07 ashwinstar

Please can you elaborate with example. Thanks.

Problem:

We have backup of the whole greenplum cluster with wal files in s3. We want to restore only specific table (suppose table1). But we don't want to download the whole backup.

Current solution without ignore_invalid_pages:

  1. Run command
wal-g backup-fetch 'some_backup' --restore-only=tbl1

that download and copy to /base directory only files related to table1. Also this command setup recovery.conf files like this:

restore_command = '/usr/bin/wal-g seg wal-fetch "%f" "%p" --content-id=-1 --config /etc/wal-g/wal-g.yaml'
recovery_target_name = 'some_backup'
recovery_target_timeline = latest
  1. Run gpstart -a and sometimes(1) get this error:
UTC,,,p10640,th-1769571392,,,,0,,,seg0,,,,,"PANIC","XX000","WAL contains references to invalid pages",,,,,,,0,,"xlogutils.c",247,"Stack trace:
1    0x555e35f56091 postgres errstart + 0x1f1
2    0x555e35f58e1e postgres elog_finish + 0xce
3    0x555e35a791e6 postgres <symbol not found> + 0x35a791e6
4    0x555e35a67b0d postgres <symbol not found> + 0x35a67b0d
5    0x555e35a6ff0d postgres StartupXLOG + 0x14cd
6    0x555e35d56969 postgres StartupProcessMain + 0x139
7    0x555e35ab08cc postgres AuxiliaryProcessMain + 0x55c
8    0x555e35d52149 postgres <symbol not found> + 0x35d52149
9    0x555e35d56106 postgres PostmasterMain + 0x1186
10   0x555e359d61aa postgres main + 0x4aa
11   0x7ff393e53c87 libc.so.6 __libc_start_main + 0xe7
12   0x555e359e229a postgres _start + 0x2a

startup failed!

Solution with ignore_invalid_pages:

  1. we run the same command
wal-g backup-fetch 'some_backup' --restore-only=tbl1

But now it insert ignore_invalid_pages = on into postgresql.conf's of each segment

  1. Start cluster: no PANICs!
  2. Remove ignore_invalid_pages = on
  3. Restart cluster
  4. Use: psql -p 6000 -c 'select * from table1'

partial recovery was successful!

(1) when we create a backup under load and there are some changes between pg_start_backup() and pg_stop_backup(), to table other than table1

Interesting... reads fancy. How much of this is practically possible I have my doubts about it. I am guessing there is no skipping of catalog/system tables possible - so will have to identify all those relfiles and replay contents for them. Not sure how such identifications is done. Please resolving all other related tables dependencies and also replaying things corresponding to it.

ashwinstar avatar Jul 10 '23 08:07 ashwinstar

I am guessing there is no skipping of catalog/system tables possible

In wal-g implementation we download all system (with oid <= 16384) relfiles/databases and tables from pg_aoseg. namespace. Wal-g also replay all wal, and as far as I know it is replayed correctly on existence file and some how on skipped.

Please resolving all other related tables dependencies and also replaying things corresponding to it.

Yeah. Nice idea todo.

nikifkon avatar Jul 10 '23 09:07 nikifkon

I am guessing there is no skipping of catalog/system tables possible

In wal-g implementation we download all system (with oid <= 16384) relfiles/databases and tables from pg_aoseg. namespace. Wal-g also replay all wal, and as far as I know it is replayed correctly on existence file and some how on skipped.

Curious to learn how wal-g traces out the relfilenodes corresponding to the system and pg_aoseg tables and passes such info to restore side.

ashwinstar avatar Jul 10 '23 16:07 ashwinstar

Curious to learn how wal-g traces out the relfilenodes corresponding to the system and pg_aoseg tables and passes such info to restore side.

If by trace out you mean select what to download from s3

Logic for system tables: we consider relfilenode as a system if 1 or 2

  1. it is from system catalog (with dirname <= 16384)
  2. file has system prefix

To select tables from namespace pg_aoseg we have the following structure in s3 associated with backup:

    "DatabasesByNames": {
        "db": {
            "oid": 24576,
            "tables": {
                "pg_aoseg.pg_aocsseg_24601": 24587,
                "pg_aoseg.pg_aocsseg_24608": 24591,
                "pg_aoseg.pg_aoseg_24587": 24579,
                "pg_aoseg.pg_aoseg_24594": 24583,
                "pg_aoseg.pg_aovisimap_24587": 24580,
                "pg_aoseg.pg_aovisimap_24587_index": 24581,
                "pg_aoseg.pg_aovisimap_24594": 24584,
                "pg_aoseg.pg_aovisimap_24594_index": 24585,
                "pg_aoseg.pg_aovisimap_24601": 24588,
                "pg_aoseg.pg_aovisimap_24601_index": 24589,
                "pg_aoseg.pg_aovisimap_24608": 24592,
                "pg_aoseg.pg_aovisimap_24608_index": 24593,
                "public.ao_to_restore": 24578,
                "public.ao_to_skip": 24582,
                "public.co_to_restore": 24586,
                "public.co_to_skip": 24590,
                "public.heap_to_restore": 24576,
                "public.heap_to_skip": 24577
            }
        },
        "gpadmin": {
            "oid": 16384
        },
        "postgres": {
            "oid": 12812,
            "tables": {
                "public.gp_stat_archiver": 16384
            }
        },
        "template1": {
            "oid": 1
        }
    }

nikifkon avatar Jul 11 '23 06:07 nikifkon

Curious to learn how wal-g traces out the relfilenodes corresponding to the system and pg_aoseg tables and passes such info to restore side.

If by trace out you mean select what to download from s3

Logic for system tables: we consider relfilenode as a system if 1 or 2

  1. it is from system catalog (with dirname <= 16384)
  2. file has system prefix

To select tables from namespace pg_aoseg we have the following structure in s3 associated with backup:

    "DatabasesByNames": {
        "db": {
            "oid": 24576,
            "tables": {
                "pg_aoseg.pg_aocsseg_24601": 24587,
                "pg_aoseg.pg_aocsseg_24608": 24591,
                "pg_aoseg.pg_aoseg_24587": 24579,
                "pg_aoseg.pg_aoseg_24594": 24583,
                "pg_aoseg.pg_aovisimap_24587": 24580,
                "pg_aoseg.pg_aovisimap_24587_index": 24581,
                "pg_aoseg.pg_aovisimap_24594": 24584,
                "pg_aoseg.pg_aovisimap_24594_index": 24585,
                "pg_aoseg.pg_aovisimap_24601": 24588,
                "pg_aoseg.pg_aovisimap_24601_index": 24589,
                "pg_aoseg.pg_aovisimap_24608": 24592,
                "pg_aoseg.pg_aovisimap_24608_index": 24593,
                "public.ao_to_restore": 24578,
                "public.ao_to_skip": 24582,
                "public.co_to_restore": 24586,
                "public.co_to_skip": 24590,
                "public.heap_to_restore": 24576,
                "public.heap_to_skip": 24577
            }
        },
        "gpadmin": {
            "oid": 16384
        },
        "postgres": {
            "oid": 12812,
            "tables": {
                "public.gp_stat_archiver": 16384
            }
        },
        "template1": {
            "oid": 1
        }
    }

Interesting, Fancy, thanks. Seems, part of catalog information is exported in dump file to be used during such partial restores. (Isn't clear the number against the table names are OIDs or relfilenodes - guessing based on usage its relfilenodes).

ashwinstar avatar Jul 11 '23 11:07 ashwinstar

Can someone tell me please which test has failed?

nikifkon avatar Jul 13 '23 08:07 nikifkon

Folks, let's get this merged :)

x4m avatar Mar 21 '24 08:03 x4m

+1

Sanikadze avatar Mar 26 '24 08:03 Sanikadze

Real needed functionality, please merge

diarworld avatar Apr 02 '24 05:04 diarworld