pgcopydb icon indicating copy to clipboard operation
pgcopydb copied to clipboard

Implement support for filtering materialized view and data (refresh).

Open arajkumar opened this issue 10 months ago • 9 comments

Currently, there is no way to filter materialized view or materialized view data(refresh).

pg_dump supports it with following option,

  • pg_dump --exclude-table filters materialized view altogether.
  • pg_dump --exclude-table-data filters REFRESH MATERIALIZED VIEW.

Solution: Incorporate materialized view filtering into existing filtering framework. [exclude-table], [include-table] can be used filter the materialized view and [exclude-table-data] can be used to filter materialized view data(refresh).

arajkumar avatar Apr 09 '24 08:04 arajkumar

Tests are failing with the following message.,

The action 'Run a test' has timed out after 5 minutes.

arajkumar avatar Apr 09 '24 09:04 arajkumar

I'm now wondering about using --exclude-table-data. Sorry didn't think about it before.

Would it make more sense to filter REFRESH commands in the Archive Table of Contents instead? Can we reduce the option for only the materialized views otherwise? We have users with so many table (millions) that we want to avoid cluttering the command line if possible.

dimitri avatar Apr 15 '24 15:04 dimitri

@dimitri

I'm now wondering about using --exclude-table-data. Sorry didn't think about it before.

Yeah, I thought of it, but I felt the approach used in this PR is quite easy to implement than filtering Archive Table of Contents.

Can we reduce the option for only the materialized views otherwise?

Could you please explain more?

Do you mean to introduce a dedicated filtering section only for materialized views? e.g.

[exclude-materialized-view-refresh]
schema1.view1
schema2.view2

or

Do you mean to introduce a flag which automatically skips all matview refresh? e.g.

pgcopydb clone --skip-materialized-view-refresh

We have users with so many table (millions) that we want to avoid cluttering the command line if possible.

I'm wondering this would be an issue only when the user wants to exclude copying table data for lots(millions) of tables right? How common is that case?

arajkumar avatar Apr 16 '24 05:04 arajkumar

I'm now wondering about using --exclude-table-data. Sorry didn't think about it before.

Yeah, I thought of it, but I felt the approach used in this PR is quite easy to implement than filtering Archive Table of Contents.

I see. For filtering at pg_restore level using the Archive Table of Contents we need a filter in copydb_write_restore_list for ARCHIVE_TAG_REFRESH_MATERIALIZED_VIEW entries. We need to see if these entries contain the OID of the MATVIEW only, or also the nspname and relname, and we need to match that name to the filtering setup.

So in terms of complexity it's just another lookup in that loop. In terms of code to write, we might need a lot of infrastructure, maybe even a new SQLite catalog table (s_matviews) with the corresponding Postgres catalogs query and SQLite lookup function.

Can we reduce the option for only the materialized views otherwise?

Could you please explain more?

What I mean is that if the [exclude-table-data] filter contains many tables and a few matview, is there a way to only output the pg_dump command line arguments for the matviews in there, skipping the plain tables?

There is a Unix limit on the length of a command line and it's not very large (use getconf ARG_MAX on Unix systems to discover that limit in bytes).

dimitri avatar Apr 16 '24 09:04 dimitri

We need to see if these entries contain the OID of the MATVIEW only, or also the nspname and relname, and we need to match that name to the filtering setup.

@dimitri Archive TOC contains both oid and nspname + relname.

6712; 0 93394 MATERIALIZED VIEW DATA public metrics_count tsdbadmin

Do you think we could simply filter them out by comparing nspname + relname with the [exclude-table-data] list and ignore them instead of going through the complex filtering setup?

arajkumar avatar Apr 29 '24 07:04 arajkumar

We need to see if these entries contain the OID of the MATVIEW only, or also the nspname and relname, and we need to match that name to the filtering setup.

@dimitri Archive TOC contains both oid and nspname + relname.

6712; 0 93394 MATERIALIZED VIEW DATA public metrics_count tsdbadmin

Do you think we could simply filter them out by comparing nspname + relname with the [exclude-table-data] list and ignore them instead of going through the complex filtering setup?

Well apparently, if I read correctly, the OID is zero, which means we can't lookup by OID here. We have to lookup by restore_list_name, which unfortunately isn't machine friendly. That's because characters such as \n are replaced with spaces, and spaces are not escaped, no quoting is used, and the separator between the three fields is also a space.

With that in mind, I think the best approach here is to fill-in a new s_matviews catalog in the sourceDB, where we store the list of materialized views from the source database, including the computed restore_list_name field. We can then add that to the filterDB filter table and would not need to change the lookup bits I believe.

What do you think?

dimitri avatar Apr 29 '24 10:04 dimitri

6712; 0 93394 MATERIALIZED VIEW DATA public metrics_count tsdbadmin

@dimitri IIUC, catalog id is 0 and oid is 93394. The below is a result of querying pg_class using the oid.

select * from pg_class where oid=93394;

-[ RECORD 1 ]-------+--------------
oid                 | 93394
relname             | metrics_count
relnamespace        | 2200
reltype             | 93396
reloftype           | 0
relowner            | 16418
relam               | 2
relfilenode         | 93418
reltablespace       | 0
relpages            | 0
reltuples           | -1
relallvisible       | 0
reltoastrelid       | 0
relhasindex         | f
relisshared         | f
relpersistence      | p
relkind             | m
relnatts            | 1
relchecks           | 0
relhasrules         | t
relhastriggers      | f
relhassubclass      | f
relrowsecurity      | f
relforcerowsecurity | f
relispopulated      | t
relreplident        | d
relispartition      | f
relrewrite          | 0
relfrozenxid        | 44467302
relminmxid          | 1586679
relacl              |
reloptions          |
relpartbound        |

arajkumar avatar Apr 29 '24 14:04 arajkumar

@dimitri IIUC, catalog id is 0 and oid is 93394. The below is a result of querying pg_class using the oid.

Ah nice, okay, we can indeed do a lookup by OID then. I know you wanted to reduce the amount of code to write, but I still think that the robust approach is to implement that filtering the same way as we implement the table filtering, with s_matviews and then just an OID lookup in the filter table in filterDB.

dimitri avatar Apr 29 '24 15:04 dimitri

@dimitri Modified the implementation to use filter catalog to filter matview and refresh.

arajkumar avatar May 04 '24 04:05 arajkumar

LGTM. Added a small comment to refactor if into switch, other than that it feels ready to merge. Will mark as such, please send a last round of update with the stylistic change if you agree with it!

@dimitri I really liked the way this PR shaped up :) Thanks a ton for your guidance.

arajkumar avatar May 29 '24 16:05 arajkumar

Uh oh, because of merge ordering you now have a conflict to merge in schema.c

dimitri avatar May 29 '24 17:05 dimitri

@dimitri I really liked the way this PR shaped up :) Thanks a ton for your guidance.

Thanks for all the work @arajkumar ; again! And as Tom Lane said one in pgsql-hackers: committers don't grow on trees ;-)

dimitri avatar May 30 '24 10:05 dimitri