materialize icon indicating copy to clipboard operation
materialize copied to clipboard

[Epic] System table revamp

Open benesch opened this issue 3 years ago • 13 comments

In no particular order:

To Rename

  • [x] https://github.com/MaterializeInc/materialize/issues/2186
  • [x] mz_source_info needs a better name. "info" is a pet peeve of mine. Removed in dac048f69c37e1f0b9295c9e10541b09ab66ddd4.
  • [ ] https://github.com/MaterializeInc/materialize/issues/14313
  • [ ] https://github.com/MaterializeInc/materialize/issues/14312
  • [x] #14392
  • [x] #13851

To Remove

  • [x] #14389
  • [ ] #14390

To Redesign / Larger Efforts

  • [x] #7172
  • [x] #7173
  • [x] #14350.
  • [x] #14237
  • [x] #5129
  • [ ] #11231
  • [ ] #14724
  • [ ] #14738

Changes to Tables Generated from Compute Logs/Introspection Sources

  • [x] #14394
  • [x] #14396
  • [x] #14397
  • [x] #14398
  • [x] #14407

benesch avatar Jun 25 '21 06:06 benesch

Oooh, while we're at it - can we name the "id" columns on each of these tables such that NATURAL JOIN (or at least JOIN ... USING (id_field)) becomes possible?

antifuchs avatar Jun 25 '21 13:06 antifuchs

That's a great idea too!

benesch avatar Jun 25 '21 18:06 benesch

mz_source_info needs a better name. "info" is a pet peeve of mine.

#7230 is a +1 for this, plus possibly layering a view on top of mz_sources/mz_source_info if that gives us a friendlier UX for the common case.

elindsey avatar Aug 19 '21 14:08 elindsey

Something that came up in a discussion with @vmarcos: The mz_sources table currently contains both Materialize-external sources (Kafka, Postgres), and introspection sources. Those are fundamentally different things so it might make sense to separate them in the system tables too.

We talked about removing the old index-based introspection sources and moving the new storage-based ones into a separate schema, to declutter the catalog and make it clear that user should not rely on these sources to be stable. So maybe this issue is already solved, I'm not up-to-date on the state of discussions here. (cc @lluki)

teskje avatar Aug 01 '22 08:08 teskje

We have observed in connection with PR #14031 a few more naming inconsistencies that could be considered as part of this issue:

  • mz_materialization_dependencies has a column termed dataflow that refers to the ID of the index or materialized view that is backed by a given dataflow. So technically, the column name might not be correct, since it does not match the meaning for a dataflow ID expressed in other catalog relations (e.g., mz_dataflow_names).
  • the catalog relations mz_materializations, mz_materialization_frontiers, mz_worker_materialization_frontiers, mz_materialization_source_frontiers, mz_worker_materialization_source_frontiers, and mz_worker_materialization_delays all use the term global_id for the same concept. However, it is unclear if global_id is a meaningful term or one that we wish to expose to users of the catalog.

vmarcos avatar Aug 05 '22 16:08 vmarcos

mz_records_per_dataflow_global returns only the dataflows in the cluster running the command. It requires the user to switch between all the clusters (SET CLUSTER = ...) to know all the running dataflows.

A cluster_id column like in mz_indexes would be great to avoid switching between clusters.

joacoc avatar Aug 28 '22 21:08 joacoc

@joacoc mz_records_per_dataflow_global is an introspection view, i.e. a view that depends on compute introspection sources (whereas mz_indexes is a system table). As such it can only return information from a single replica. For single-replica clusters, it simply returns the information from the single replica, for multi-replica clusters you have to select a replica using SET CLUSTER_REPLICA = .... Not that you can get different responses when querying an introspection source/view against different replicas, even in the same cluster.

Adding cluster_id and replica_id columns would be possible, but it wouldn't help much, because you'd still need to manually switch between replicas. What we really want is a way to union introspection streams from all replicas and keep that union updated as replicas are added and removed. This is difficult to implement and, AFAIK, is not planned for GA.

(cc @lluki who has more context than me.)

teskje avatar Aug 29 '22 07:08 teskje

@joacoc We also think that adding a replica/cluster id column would be the most idiomatic SQL, but we have pushed this out for now as it is a bit tricky to get it working correctly in all scenarios (see the summary of the discussion here ).

Instead we opted for now, to add a postfixed variant of all introspection sources. That is you can get the data of a specific cluster id by using SELECT * FROM mz_records_per_dataflow_global_1 (where 1 is the cluster id).

In terms of naming, "global" seems indeed a bit overloaded. Is it global per environmentd? Or just per cluster?

lluki avatar Aug 29 '22 07:08 lluki

Issue for offering unified introspection data: #14164

lluki avatar Aug 29 '22 08:08 lluki

Thank you! @teskje @lluki

Is the following assumption correct?:

  • Most of the views from the mz_catalog (SHOW VIEWS FROM mz_catalog) are cluster-dependent and identifiable by the cluster-id at the end of the name. E.g: mz_catalog.{type}-{cluster_id}.
  • All the tables from the mz_catalog (SHOW TABLES FROM mz_catalog) are cluster-independent.

joacoc avatar Aug 29 '22 10:08 joacoc

Most of the views from the mz_catalog (SHOW VIEWS FROM mz_catalog) are cluster-dependent and identifiable by the cluster-id at the end of the name. E.g: mz_catalog.{type}-{cluster_id}.

Not quite: Views the number suffix of these views (e.g. mz_records_per_dataflow_1) is the replica ID, not the cluster ID! Those views are replica-dependent. Since each replica is part of a cluster, that makes them also cluster-dependent.

Apart from the suffixed-views, which are based on the new persist-based introspection sources, there are also views with the same names but without suffix (e.g. mz_records_per_dataflow). These are based on the old arrangement-based introspection sources. Those are also replica-dependent, but the replica they depend on is selected through the CLUSTER_REPLICA session variable (or defaults to the only replica in the current cluster, for single-replica clusters).

All other views (e.g. mz_objects) are based on system tables and therefore replica/cluster-independent.

All the tables from the mz_catalog (SHOW TABLES FROM mz_catalog) are cluster-independent.

That's right, these are all system tables.

teskje avatar Aug 30 '22 08:08 teskje

@benesch I found another one - should mz_peek_active be mz_active_peeks or mz_peeks_active at least?

chaas avatar Sep 06 '22 21:09 chaas

@benesch I found another one - should mz_peek_active be mz_active_peeks or mz_peeks_active at least?

mz_active_peeks, yes!

benesch avatar Sep 07 '22 05:09 benesch

🎉 This is done!

chaas avatar Oct 05 '22 21:10 chaas

YES!

On Wed, Oct 5, 2022 at 5:53 PM Cara Haas @.***> wrote:

🎉 This is done!

— Reply to this email directly, view it on GitHub https://github.com/MaterializeInc/materialize/issues/7174#issuecomment-1269020197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIGACZTBC2PUOJSSICLWBX2DXANCNFSM47JHBJYQ . You are receiving this because you were mentioned.Message ID: @.***>

benesch avatar Oct 05 '22 21:10 benesch