materialize icon indicating copy to clipboard operation
materialize copied to clipboard

[Epic] Refresh options for materialized views

Open ggevay opened this issue 2 years ago • 2 comments

Product outcome

TLDR: Users should be able to configure materialized views to compute result changes less frequently, but cheaper.

Expand for more details...

Copied from the design doc.

Materialize keeps materialized views fresh by running a dataflow continuously. These dataflows consume significant resources, but the resource usage is worth it for a large class of use cases. Therefore, the use cases that are the main focus of Materialize are the ones that can derive value from always fresh computation results, termed Operational Data Warehouse (ODW) use cases.

ODW use cases typically focus on the most recent input data (hot data). However, there is often a significantly larger amount of older data lying around (cold data). The cold data is often of the same kind as the hot data, and therefore we postulate that users will often want to perform the same or similar computations on these. In this case, it would be convenient for users to simply use Materialize for both hot and cold computations, and not complicate their architectures by implementing the same computations in both Materialize and a different, traditional data warehouse.

The problem is that Materialize is currently not so cost-effective for running a materialized view on cold data, because

  • there is significantly more cold data than hot data, so a much larger compute instance is needed to keep compute state in memory, and
  • keeping results always fresh is not so valuable for cold data.

This currently prevents us from capturing these cold use cases, even though they feel very close to our hot, core ODW use cases.

(Old product repo issue: https://github.com/MaterializeInc/product/issues/231)

Discovery

We discussed various alternatives. The main alternative is not implementing any new thing, but just telling customers to use tables and update them with externally scheduled jobs. However, this would be quite complicated for customers to manage, and thus would take away from the simplicity of Materialize, which is an important feature of Materialize. See the design doc for more discussion.

We also implemented a prototype with Jan, validating most importantly that frontiers move as we need and expect them.

Work items

Private Preview

- [x] Prototype. Done: https://github.com/ggevay/materialize/commit/f1f279cad786f4b3cb96ac4e376fa1bd90e160ea
- [ ] https://github.com/MaterializeInc/materialize/pull/23870
- [ ] https://github.com/MaterializeInc/materialize/pull/23819
- [ ] https://github.com/MaterializeInc/materialize/pull/22776
- [x] Put behind a feature flag
- [ ] https://github.com/MaterializeInc/materialize/pull/24219
- [x] Freshness plans: product + FE testing (Steffen tested it, so I've checked this box.)
- [ ] https://github.com/MaterializeInc/materialize/issues/24469
- [ ] https://github.com/MaterializeInc/materialize/issues/24481
- [ ] https://github.com/MaterializeInc/materialize/issues/24288
- [ ] https://github.com/MaterializeInc/materialize/issues/24591
- [ ] https://github.com/MaterializeInc/materialize/issues/24244
- [ ] https://github.com/MaterializeInc/materialize/issues/23035
- [ ] https://github.com/MaterializeInc/materialize/issues/24966
- [ ] https://github.com/MaterializeInc/materialize/issues/25279
- [ ] https://github.com/MaterializeInc/materialize/issues/25278

Simple Version of Automated Replica Management

- [ ] Automated Replica Management, as discussed here: https://www.notion.so/materialize/Compute-meeting-on-automatic-cluster-scheduling-ce353b8af52e449d8784241c4a1c0585 (Policy v2)

Observability

- [ ] https://github.com/MaterializeInc/materialize/issues/25333
- [ ] https://github.com/MaterializeInc/materialize/issues/24051
- [ ] https://github.com/MaterializeInc/console/issues/1198
- [ ] https://github.com/MaterializeInc/console/issues/1419

Public Preview

- [ ] Freshness plans: docs updates
- [ ] https://github.com/MaterializeInc/www/issues/927

Other

- [ ] Adjust index `since` in bootstrapping. Index `since` would currently be 1 sec before the next refresh, which is probably in the future. A fix would be to bring the since to the current system time if it would be in the future (if the current system time is valid).
- [ ] Implement cron schedule. Useful e.g. for skipping refresh in the weekends. (A big user wants to do this.) Also useful to get around daylight saving time issues.
- [ ] https://github.com/MaterializeInc/materialize/issues/23179
- [ ] https://github.com/MaterializeInc/product/issues/231
- [ ] https://github.com/MaterializeInc/materialize/issues/25127

Decision log

  • January 2, 2023. Updated release date to 1/18. Added tasks for private preview release.
  • December 14, 2023. Updated release date to 1/4.
  • November 21, 2023. Updated release date to 12/14. (First milestone, without automated replica management.)
  • November 15 2023. Added task https://github.com/MaterializeInc/materialize/issues/23179
  • November 14 2023. REFRESH NEVER epic merged into this one as a task. (https://github.com/MaterializeInc/materialize/issues/23012)
  • November 8 2023. Further design discussions initiated after it turned out that performing the initial refresh at the initial since would yield weird behavior.
  • November 1 2023. Frank hits approve on the design doc. (some comments still to be addressed)
  • October 30 2023. Design Doc PR submitted.
  • October 27 2023. Prototype with Jan to validate the approach suggested by Frank. An interesting observation was that the Adapter is actually holding sinces near the current time, so we don't need changes in the Compute Controller to prevent sinces from jumping forward to the next refresh.
  • October 25 2023. After lots of discussions, Frank suggests in the Timely office hours that we should implement the tricky part simply by rounding up timestamps in the Persist sink. (Instead of e.g., launching new single-time dataflows at every refresh.)
  • October 24 2023. We pitch the table workaround to the customer. Customer thoroughly dislikes it.
  • October 24 2023. From discussions with the customer, we realize that satisfying what later becomes Success criterion 3. in the design doc is critical. This is because the customer would like to provide a unified view of cold and hot data in PowerBI, but PowerBI can't union the two views, so we need to union them in Materialize.
  • October 2023. Lots of internal discussions, coming up with many alternatives.
  • August 2023. Need arises at a big customer to perform the same computation on both hot and cold data. The cold data is orders of magnitude bigger, but results for the cold data need to be updated much less often.
  • July 21 2022. An issue that covers the feature in this epic, but is strictly broader, is opened by Frank: https://github.com/MaterializeInc/materialize/issues/13762
  • May 3 2021. A related issue is opened by Frank: https://github.com/MaterializeInc/materialize/issues/6745 (mentioning the timestamp round-up trick)

ggevay avatar Nov 02 '23 10:11 ggevay

@ggevay, how will this work be reflected in the system catalog? Will there be a new system table that logs metadata about each refresh? A new type column in mz_materialized_views (like in mz_sources)? Thinking about e.g. console work that will depend on telling these materialized views apart from those not configured for refreshes, surfacing status and interval, and so on. It'd be great to also include system catalog-related details in the design doc or an issue!

morsapaes avatar Nov 16 '23 01:11 morsapaes

@morsapaes,

Will there be a new system table that logs metadata about each refresh?

The replicas coming and going automatically will be recorded in mz_audit_events. This also has a freeform json field, so there we could record additional metadata about how the refresh was happening.

A new type column in mz_materialized_views (like in mz_sources)?

Good idea!

Thinking about e.g. console work that will depend on telling these materialized views apart from those not configured for refreshes, surfacing status and interval, and so on.

@hlburak started a Notion doc about Console / ux work.

It'd be great to also include system catalog-related details in the design doc or an issue!

Will do!

ggevay avatar Nov 16 '23 12:11 ggevay

Closing this, as the Private Preview section is now complete. After a discussion with @lfest, I moved the parts of this issue that were not in the "Private Preview" section to a new issue (https://github.com/MaterializeInc/materialize/issues/26010). See decision log for more details.

ggevay avatar Mar 14 '24 10:03 ggevay