cloudberry icon indicating copy to clipboard operation
cloudberry copied to clipboard

[Feature] Dynamic Table.

Open avamingli opened this issue 1 year ago • 13 comments

Dynamic Table is a an auto-refreshing materialized view which could be constructed by base tables, external tables, materialized views and dynamic tables. And it could be used to answer query by AQUMV. As normal tables in CBDB, dynamic tables could also have distribution keys.

The purpose of Dynamic Tables is to solve the problem often raised by customers who are big fans of a lakehouse architecture: how can we run queries on external tables as fast as internal tables?

See details in discussion #706 and more cases in tests.

Create Dynamic Table:

CREATE DYNAMIC TABLE dt0 SCHEDULE '5 * * * *' AS
  SELECT a, b, sum(c) FROM t1 GROUP BY a, b WITH NO DATA DISTRIBUTED BY(b);
CREATE DYNAMIC TABLE

\d
                 List of relations
 Schema | Name |     Type      |  Owner  | Storage
--------+------+---------------+---------+---------
 public | dt0  | dynamic table | gpadmin | heap
 public | t1   | table         | gpadmin | heap
(2 rows)

CREATE DYNAMIC TABLE xxx AS Query The Query allows any valid SELECT SQL of Materialized Views: from single or multiple relations, base tables, materialized views, and dynamic tables as well, joins, subquery, aggregation, group by and etc. However, if you want to use it to Answer Query, that is limited by AQUMV: currently we allow Select from single base table, aggregation on it or aggregation SQL replace directly #705

SCHEDULE:

A string used to schedule background job which auto-refreshes the dynamic table. We follow the valid string of pg_cron extension which supports linux crontab, refer https://crontab.guru/ .

 ┌───────────── min (0 - 59)
 │ ┌────────────── hour (0 - 23)
 │ │ ┌─────────────── day of month (1 - 31) or last day of the month ($)
 │ │ │ ┌──────────────── month (1 - 12)
 │ │ │ │ ┌───────────────── day of week (0 - 6) (0 to 6 are Sunday to
 │ │ │ │ │                  Saturday, or use names; 7 is also Sunday)
 │ │ │ │ │
 │ │ │ │ │
 * * * * *

You can also use '[1-59] seconds' to schedule a job based on an interval. The example creates a cron job refreshing the dynamic table at minute 5 of each hour.

User don't need to consider the auto-refresh job, however query on pg_task catalog if we want to see the task:

SELECT * FROM pg_task WHERE jobname LIKE 'gp_dynamic_table_refresh%' AND command LIKE '%dt0';
 jobid | schedule  |             command              | nodename  | nodeport | database | username | active |
   jobname
-------+-----------+----------------------------------+-----------+----------+----------+----------+--------+---------
-----------------------
 17398 | 5 * * * * | REFRESH DYNAMIC TABLE public.dt0 | 127.0.0.1 |     9000 | gpadmin  | gpadmin  | t      | gp_dynam
ic_table_refresh_17394
(1 row)

As Snowflake, Dynamic Tables should always have a auto-refresh process. However, for convenience, I make SCHEDULE optional. If user didn't specific it, a default schedule is provided: maybe at every 5th minute(snowflake limit at most 5 minutes for dynamic table auto-refresh, not sure)?

WITH NO DATA:

Same as Materialized View, will create an empty Dynamic Table if specified.

DISTRIBUTED BY:

Same as normal tables in CBDB, Dynamic Tables could support distribution keys as materialized views. Use \d+ to see the distribution keys and the Query SQL of Dynamic Tables.

\d+ dt0;
                                       Dynamic table "public.dt0"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+---------+-----------+----------+---------+---------+-------------+--------------+-------------
 a      | integer |           |          |         | plain   |             |              |
 b      | integer |           |          |         | plain   |             |              |
 sum    | bigint  |           |          |         | plain   |             |              |
View definition:
 SELECT t1.a,
    t1.b,
    sum(t1.c) AS sum
   FROM t1
  GROUP BY t1.a, t1.b;
Distributed by: (b)
Access method: heap

Refresh Dynamic Table

As seen in pg_task, we put a command to auto-refresh dynamic tables. However, if users want to do a REFRESH manually, exec command REFRESH DYNAMIC TABLE is also supported.

REFRESH DYNAMIC TABLE dt0;
REFRESH DYNAMIC TABLE

REFRESH WITH NO DATA;

Same as Materialized Views, Refresh with no data will truncate the Dynamic Table and make it unpopulated status.

REFRESH DYNAMIC TABLE dt0 WITH NO DATA;
REFRESH DYNAMIC TABLE

Drop Dynamic Table:

DROP DYNAMIC TABLE dt0;
DROP DYNAMIC TABLE

Drop a Dynamic Table will drop its scheduler job automatically.

SELECT * FROM pg_task WHERE jobname LIKE 'gp_dynamic_table_refresh%' AND command LIKE '%dt0';
 jobid | schedule | command | nodename | nodeport | database | username | active | jobname
-------+----------+---------+----------+----------+----------+----------+--------+---------
(0 rows)

Privileges

Same as Materialized Views in CBDB:

\z
                                Access privileges
 Schema | Name |     Type      | Access privileges | Column privileges | Policies
--------+------+---------------+-------------------+-------------------+----------
 public | dt1  | dynamic table |                   |                   |
 public | t1   | table         |                   |                   |
(2 rows)

Use Dynamic Tables to answer query

Like Materialized Views, Dynamic Tables could be used to answer query too:

CREATE DYNAMIC TABLE dt1  AS
  SELECT * FROM t1 WHERE a = 1 DISTRIBUTED BY(b);
ANALYZE dt1;
SELECT 2
SET enable_answer_query_using_materialized_views = ON;
EXPLAIN(COSTS OFF, VERBOSE)
SELECT * FROM t1 WHERE a = 1;
SELECT * FROM t1 WHERE a = 1;
SET
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: a, b, c
   ->  Seq Scan on public.dt1
         Output: a, b, c
 Settings: enable_answer_query_using_materialized_views = 'on', optimizer = 'off'
 Optimizer: Postgres query optimizer
(6 rows)

 a | b | c
---+---+---
 1 | 2 | 3
 1 | 2 | 3
(2 rows)

Authored-by: Zhang Mingli [email protected]

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • [ ] Bug fix (non-breaking change)
  • [ ] New feature (non-breaking change)
  • [ ] Breaking change (fix or feature with breaking changes)
  • [ ] Documentation update

Breaking Changes

Test Plan

  • [ ] Unit tests added/updated
  • [ ] Integration tests added/updated
  • [ ] Passed make installcheck
  • [ ] Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

⚠️ To skip CI: Add [skip ci] to your PR title. Only use when necessary! ⚠️


avamingli avatar Nov 21 '24 07:11 avamingli

Hi! some initial review:

  1. Patch miss tab-completion for DYNAMIC feature.

When trying to modify dynamic table

reshke=# insert into td values(1, 1);
ERROR:  cannot change materialized view "td"

Maybe some better error message here? we can check for relisdynamic and provide "cannot directly change dynamic table " with some errdetail, like "DETAIL: DYNAMIC TABLE data is automatically populated from its source query."

  1. Im not sure if this a problem, but example:
reshke=# alter materialized view td rename TO tdd;
ALTER MATERIALIZED VIEW

td here is dynamic table. This all comes from fact that dynamic table relkind is 'm' (mat. view). I'm not entirely sure if this design is good. Although the implementation is undoubtedly simpler in this manner, it is at least perplexing that changing a dynamic table requires M.V. SQL syntax. Another case here:

reshke=# drop dynamic table tdd;
DROP DYNAMIC TABLE
reshke=# drop materialized view tdd ;
DROP MATERIALIZED VIEW

Both succeeds.

Maybe we should create this relation as CREATE DYNAMIC MATERIALIZED VIEW not TABLE? This should fix a lot of SQL ambiguity problems.

In any case, we should add this ALTER pattern to regression tests and have document them in some form.

I will take another look later.

reshke avatar Nov 21 '24 10:11 reshke

This all comes from fact that dynamic table relkind is 'm' (mat. view). I'm not entirely sure if this design is good. Although the implementation is undoubtedly simpler in this manner, it is at least perplexing that changing a dynamic table requires M.V. SQL syntax. Another case here:

reshke=# drop dynamic table tdd;
DROP DYNAMIC TABLE
reshke=# drop materialized view tdd ;
DROP MATERIALIZED VIEW

Both succeeds.

Yes, it's a materialized view actually.

Dynamic Table is a an auto-refreshing materialized view

Dynamic Table have all operations from Materialized Views, and it must do. They are same as GPDB's external table and foreign table, if you have a try:

create external table ext1;
drop external table ext1;
drop foreign table ext1;

Both will succeed, external table is actually a foreign table.

We provide gram sugar for that CREATE/DROP, but not all command or all message infos, if users want to use Materialized Views command, that's no problem.

Maybe we should create this relation as CREATE DYNAMIC MATERIALIZED VIEW not TABLE? This should fix a lot of SQL ambiguity problems.

We follow Snowflake, customers want to have Snowflake Dynamic Tables in CBDB, better use the same word.

Please refer to Snowflake https://docs.snowflake.com/en/user-guide/dynamic-tables-intro and our discussion #706 for more details.

avamingli avatar Nov 21 '24 10:11 avamingli

In any case, we should add this ALTER pattern to regression tests and have document them in some form.

Request for discuss: https://github.com/apache/cloudberry/discussions/706#discussioncomment-11320344

avamingli avatar Nov 21 '24 10:11 avamingli

  1. Patch miss tab-completion for DYNAMIC feature.

Oo, seems you'r big fans of tab-completion. It's ok to add them. I'm not planing to add this right now, or at least not in this pr.
Maybe someone have interest could help.

avamingli avatar Nov 21 '24 11:11 avamingli

Dynamic Table have all operations from Materialized Views, and it must do. They are same as GPDB's external table and foreign table, if you have a try:

create external table ext1;
drop external table ext1;
drop foreign table ext1;

Both will succeed, external table is actually a foreign table.

exmaple:

CREATE READABLE EXTERNAL TABLE ext_r(id int)
    LOCATION('demoprot://dynamic_table_text_file.txt')
FORMAT 'text';
CREATE EXTERNAL TABLE
 \d
                 List of relations
 Schema | Name  |     Type      |  Owner  | Storage
--------+-------+---------------+---------+---------
 public | ext_r | foreign table | gpadmin |
(1 row)

 drop external table ext_r;
DROP FOREIGN TABLE
CREATE READABLE EXTERNAL TABLE ext_r(id int)
    LOCATION('demoprot://dynamic_table_text_file.txt')
FORMAT 'text';
CREATE EXTERNAL TABLE
 \d
                 List of relations
 Schema | Name  |     Type      |  Owner  | Storage
--------+-------+---------------+---------+---------
 public | ext_r | foreign table | gpadmin |
(1 row)

drop foreign table ext_r;
DROP FOREIGN TABLE

In summary, we'r doing the similar thing like external table with foreign table, for dynamic tables and materialized views. We provide some command for it, and all valid commands on matview should work on dynamic tables too. I'm not planing to copy lot's of duplicated codes from matview for just several tiny adjustment, that's pointless and the worse is: when postgres have some updated on matview command, we won't catch up and need copy the new changes too. Since they are actually the same thing, let it be.

avamingli avatar Nov 21 '24 11:11 avamingli

I'm not planing to copy lot's of duplicated codes from matview

Im not suggesting this, the thing i propose is to change CREATE DYNAMIC TABLE syntax to CREATE DYNAMIC MATERIALIZED VIEW for convenience

reshke avatar Nov 21 '24 11:11 reshke

I'm not planing to copy lot's of duplicated codes from matview

Im not suggesting this, the thing i propose is to change CREATE DYNAMIC TABLE syntax to CREATE DYNAMIC MATERIALIZED VIEW for convenience

How about we keep CREATE DYNAMIC TABLE? Since snowflake has this feature already and we follow the same term?.

my-ship-it avatar Nov 26 '24 03:11 my-ship-it

Will provide DOC soon.

avamingli avatar Nov 26 '24 03:11 avamingli

Will provide DOC soon.

Done, see doc/src/sgml/ref/create_dynamic_table.sgml, doc/src/sgml/ref/drop_dynamic_table.sgml, doc/src/sgml/ref/refresh_dynamic_table.sgml

avamingli avatar Nov 26 '24 08:11 avamingli

With catalog_change in play, I am assuming we are going to bump the version string to 2.0.0?

edespino avatar Nov 26 '24 09:11 edespino

With catalog_change in play, I am assuming we are going to bump the version string to 2.0.0?

Woo..that will make CBDB version grow up quickly, as we have lot's of codes to be open source with catalog changes. In fact, we guarantee catalog no changed in patch versions in commercial ones.

Ask @my-ship-it for help about this topic, that should be another discussion outside of this PR.

When developers develop codes, they don't need to consider what the version should be, a catalog_change label is necessary if there was. That will help release team to decide version numbers and don't need to sync with authors which PR changed the catalog.

avamingli avatar Nov 26 '24 09:11 avamingli

With catalog_change in play, I am assuming we are going to bump the version string to 2.0.0?

Yes, it is.

my-ship-it avatar Nov 26 '24 09:11 my-ship-it

When mirror is promoted, should we launch auto task for Dynamic Table?

my-ship-it avatar Nov 26 '24 11:11 my-ship-it

When mirror is promoted, should we launch auto task for Dynamic Table?

Synced with my colleague who is familiar with pg_task extension, and confirmed that pg_task could work on mirrors.

avamingli avatar Nov 27 '24 04:11 avamingli

Patch miss tab-completion for DYNAMIC feature.

According to various feedback, I will enable this. Perhaps within pg_dump patch of Dynamic Table.

avamingli avatar Dec 04 '24 10:12 avamingli

Patch miss tab-completion for DYNAMIC feature.

According to various feedback, I will enable this. Perhaps within pg_dump patch of Dynamic Table.

Tab-complete added in 80bfc73fbec9e680dd68afa27aee4f953fbc7b87. pg_dump added in 0b99585b373eafbab5272d6f43a34d3555e73cf2.

avamingli avatar Dec 04 '24 15:12 avamingli

Add commit: https://github.com/apache/cloudberry/pull/725/commits/7a13989d24192bb471ff668a1754ae489e538ef5 to give an example of Dynamic Table use case: how to query on external tables as fast as internal tables?

avamingli avatar Dec 04 '24 16:12 avamingli

Rebased and fixed: Bump catversion in dadeb84c0266d0117fd77cfa5ab8719d19b878a3

avamingli avatar Dec 06 '24 05:12 avamingli

Update: two commits added to resolve the missed SCHEDULE clause when pg_dump DYNAMIC TABLE.

b215893a5dc367462d956a5113c2723cb090a8c5 Add it when a Dynamic Table is dumped.

Dynamic Table's SCHEDULE clause is stored in pg_task jobs. Since the SCHEDULE clause is optional, there would be no error if we forget it when dump a Dynamic Table info. And a default SCHEDULE is added with the value of Macro: DYNAMIC_TABLE_DEFAULT_REFRESH_INTERVAL

17c9fedb96049380a80477047c41f678375c7cd5 provides a function pg_get_dynamic_table_schedule() to get SCHEDULE info for pg_dump. And, users also could use it to see SCHEDULE info of Dynamic Table.

avamingli avatar Dec 06 '24 12:12 avamingli