materialize
materialize copied to clipboard
[Epic] System table revamp
In no particular order:
To Rename
- [x] https://github.com/MaterializeInc/materialize/issues/2186
- [x]
mz_source_info
needs a better name. "info" is a pet peeve of mine. Removed in dac048f69c37e1f0b9295c9e10541b09ab66ddd4. - [ ] https://github.com/MaterializeInc/materialize/issues/14313
- [ ] https://github.com/MaterializeInc/materialize/issues/14312
- [x] #14392
- [x] #13851
To Remove
- [x] #14389
- [ ] #14390
To Redesign / Larger Efforts
- [x] #7172
- [x] #7173
- [x] #14350.
- [x] #14237
- [x] #5129
- [ ] #11231
- [ ] #14724
- [ ] #14738
Changes to Tables Generated from Compute Logs/Introspection Sources
- [x] #14394
- [x] #14396
- [x] #14397
- [x] #14398
- [x] #14407
Oooh, while we're at it - can we name the "id" columns on each of these tables such that NATURAL JOIN
(or at least JOIN ... USING (id_field)
) becomes possible?
That's a great idea too!
mz_source_info needs a better name. "info" is a pet peeve of mine.
#7230 is a +1 for this, plus possibly layering a view on top of mz_sources/mz_source_info if that gives us a friendlier UX for the common case.
Something that came up in a discussion with @vmarcos: The mz_sources
table currently contains both Materialize-external sources (Kafka, Postgres), and introspection sources. Those are fundamentally different things so it might make sense to separate them in the system tables too.
We talked about removing the old index-based introspection sources and moving the new storage-based ones into a separate schema, to declutter the catalog and make it clear that user should not rely on these sources to be stable. So maybe this issue is already solved, I'm not up-to-date on the state of discussions here. (cc @lluki)
We have observed in connection with PR #14031 a few more naming inconsistencies that could be considered as part of this issue:
-
mz_materialization_dependencies
has a column termeddataflow
that refers to the ID of the index or materialized view that is backed by a given dataflow. So technically, the column name might not be correct, since it does not match the meaning for a dataflow ID expressed in other catalog relations (e.g.,mz_dataflow_names
). - the catalog relations
mz_materializations
,mz_materialization_frontiers
,mz_worker_materialization_frontiers
,mz_materialization_source_frontiers
,mz_worker_materialization_source_frontiers
, andmz_worker_materialization_delays
all use the termglobal_id
for the same concept. However, it is unclear ifglobal_id
is a meaningful term or one that we wish to expose to users of the catalog.
mz_records_per_dataflow_global
returns only the dataflows in the cluster running the command. It requires the user to switch between all the clusters (SET CLUSTER = ...
) to know all the running dataflows.
A cluster_id
column like in mz_indexes
would be great to avoid switching between clusters.
@joacoc mz_records_per_dataflow_global
is an introspection view, i.e. a view that depends on compute introspection sources (whereas mz_indexes
is a system table). As such it can only return information from a single replica. For single-replica clusters, it simply returns the information from the single replica, for multi-replica clusters you have to select a replica using SET CLUSTER_REPLICA = ...
. Not that you can get different responses when querying an introspection source/view against different replicas, even in the same cluster.
Adding cluster_id
and replica_id
columns would be possible, but it wouldn't help much, because you'd still need to manually switch between replicas. What we really want is a way to union introspection streams from all replicas and keep that union updated as replicas are added and removed. This is difficult to implement and, AFAIK, is not planned for GA.
(cc @lluki who has more context than me.)
@joacoc We also think that adding a replica/cluster id column would be the most idiomatic SQL, but we have pushed this out for now as it is a bit tricky to get it working correctly in all scenarios (see the summary of the discussion here ).
Instead we opted for now, to add a postfixed variant of all introspection sources. That is you can get the data of a specific cluster id by using SELECT * FROM mz_records_per_dataflow_global_1
(where 1 is the cluster id).
In terms of naming, "global" seems indeed a bit overloaded. Is it global per environmentd? Or just per cluster?
Issue for offering unified introspection data: #14164
Thank you! @teskje @lluki
Is the following assumption correct?:
- Most of the views from the mz_catalog (
SHOW VIEWS FROM mz_catalog
) are cluster-dependent and identifiable by thecluster-id
at the end of the name. E.g:mz_catalog.{type}-{cluster_id}
. - All the tables from the mz_catalog (
SHOW TABLES FROM mz_catalog
) are cluster-independent.
Most of the views from the mz_catalog (SHOW VIEWS FROM mz_catalog) are cluster-dependent and identifiable by the cluster-id at the end of the name. E.g: mz_catalog.{type}-{cluster_id}.
Not quite: Views the number suffix of these views (e.g. mz_records_per_dataflow_1
) is the replica ID, not the cluster ID! Those views are replica-dependent. Since each replica is part of a cluster, that makes them also cluster-dependent.
Apart from the suffixed-views, which are based on the new persist-based introspection sources, there are also views with the same names but without suffix (e.g. mz_records_per_dataflow
). These are based on the old arrangement-based introspection sources. Those are also replica-dependent, but the replica they depend on is selected through the CLUSTER_REPLICA
session variable (or defaults to the only replica in the current cluster, for single-replica clusters).
All other views (e.g. mz_objects
) are based on system tables and therefore replica/cluster-independent.
All the tables from the mz_catalog (SHOW TABLES FROM mz_catalog) are cluster-independent.
That's right, these are all system tables.
@benesch I found another one - should mz_peek_active
be mz_active_peeks
or mz_peeks_active
at least?
@benesch I found another one - should
mz_peek_active
bemz_active_peeks
ormz_peeks_active
at least?
mz_active_peeks
, yes!
🎉 This is done!
YES!
On Wed, Oct 5, 2022 at 5:53 PM Cara Haas @.***> wrote:
🎉 This is done!
— Reply to this email directly, view it on GitHub https://github.com/MaterializeInc/materialize/issues/7174#issuecomment-1269020197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIGACZTBC2PUOJSSICLWBX2DXANCNFSM47JHBJYQ . You are receiving this because you were mentioned.Message ID: @.***>